# <font color = 'blue'>Casper Market Segmentation</font>

## <font color = 'red'> Business Problem: </font>
+ What do our clients look like? (narrowed down goal by using [questions](#questions))
    + What groups have we targeted successfully? (Made money from in the past)
    + How can we identify groups to target in future marketing campaigns? (Make money from in the future)    


## <font color = 'red'> What I know: </font>
+ Have to ask [questions](#questions) to figure out what data they have.

# Table of Contents
1. Step 1 [Clarifying Questions](#questions)
2. Step 2 [Brainstorming/Big Picture](#brainstorm)
3. Step 3 [Plan of Action](#plan)
4. Step 4 [Present Results & Future Directions](#results)
5. Quick [K-Means](#kmeans) overview
6. [Coding Challenge](#code)

# <font color ='blue'>Step 1: Clarifying Questions</font> <a class="anchor" id="questions"></a>

1. Why do you want to know this information? What are your goals?
    + <strong>Ans</strong>: I dentify groups of customers we have historically sold well to and identify which customers to target in the future and how
2. Do you have Point of Sales data? 
    + <strong>Ans</strong>: Yes!
3. Do you have demographic data?
    + <strong>Ans</strong>: We could probably get that through third party providers
4. Do you have marketing data?
    + <strong>Ans</strong>: Yes!
5. Is there anything else I need to know?
6. Have you tried anything in the past? What's worked well and what has not?
7. Is there anything larger that you are building for in the future? Is there a certain way we should steer this project or present the results?
8. Is there a timline for this project?

# <font color ='blue'>Step 2: Brainstorming/Big Picture</font><a class="anchor" id="brainstorm"></a>
1. Casper likely attracts a certain type of customer: someone who is comfortable purchasing items on the internet.
    + Maybe this customer fits a certain profile (lives on the coasts/big cities, young, shops at Amazon, etc)
    + Maybe there are multiple subgroups of people who are likely to buy our product (clusters)
    + Maybe there are some outlier customers who we are surprised to see purchasing our mattresses -- WHAT HAPPENED?
2. How can we attract more "outlier customers"?
    + Can we judge from point of sale data or marketing data what encouraged them to check Casper out and ultimately purchase a mattress?
    + Channel attribution (personal referral, subway advertisement, unboxing video online, Google search). What led them directly to us?
3. What demographics have we not tapped into at all? What lines are they divided by (geographical, age)?
    + Maybe a certain age group was well-represented in NYC but not Chicago, what sort of campaigns did we run in NYC that we could easily transfer to Chicago? (Maybe it was a marketing exposure problem.)
4. What are the differences between customers who return products and those who keep them? 
    + How can we adjust our marketing, product, customer service to better fit the needs of these customers?
    
    

# <font color ='blue'>Step 3: Plan of Action</font><a class="anchor" id="plan"></a>
1. Acquire data
2. Clean data
3. Cluster Analysis for Market Segmentation:
    + **K-means!** (mutually exclusive groups)
    + Try out Hierarchical too
4. Evaluate & Iterate 
5. Name clusters & look for outliers
6. Present Results
7. Prepare for future directions

<br>
*Could try*:
+ Manually separating groups along certain lines (i.e. coastal vs. inland) and run **statistical tests** to see if there is any significant differences between these groups (ie. age: maybe older customers from the coast are more likely to buy than older customers inland). 
+ **Time series analysis** (how have clusters changed over time? What new types of customers have we attracted and when? Have any groups of customers stopped coming to Casper? Where did they go?)


# <font color ='blue'>Step 4: Present Results & Future Directions</font><a class="anchor" id="results"></a>
1. <strong>Presentation of Results</strong>:
    + Viualization of groups
    + Quick description of groups we successfully attract (who they are and why we attract them, ie. marketing campaigns)
    + Unveiling of groups/<i>outliers</i> we have shown little success with
    + Analysis of why some of those outlying customers have chosen Casper
<br> 
<br>
2. <strong>Future directions</strong>:
    + Actionable recommendations (what campaigns to run to what clusters/demographics)
    + A/B testing!
    + Was there anything missing from this picture? What data could be helpful in the future that we could start collecting now? 
    + What groups have we seen zero succss with?

# <font color="green">K-Means Clustering Overview</font> <a class="anchor" id="kmeans"></a>

The K-Means algorithm is an unsupervised learning technique used to create k clusters of mutually exclusive groups. Every observation is assigned to one and only one group. The scientist chooses the value of k, but does so with the aid of a scree plot (elbow graph) which shows the within-cluster variance for various values of k. The within-cluster variance decreases quickly at first, then levels off, creating an elbow. Often, it is at the elbow that we choose the value of k. 

Once k is chosen, the algorithm is performed. K centroids are randomly placed in the n-dimensional space of the dataset. Then, a distance measure is calculated between each observation and the k centroids. The observations are then assigned to the cluster of the closest centroid. The average of each cluster is then calculated and the centroids positions are reassigned (ie. they move). The process starts over and the distance between every observation and each newly-placed centroid is calculated. The data is reassigned to the closest centroid cluster and the averages of each cluster are calculated again. The process continues until the groups remain unchanged. 

Depending on where the centroids randomly initialize, different clusters can form (i.e. the clusters converge on different local minima). So the algorithm is run multiple times and the solution that yields the smallest aggregate within-cluster variance is chosen as having the best groups. 

Visualization of the algoirthm: https://en.wikipedia.org/wiki/K-means_clustering

# Coding Practice: Clustering based on review data <a class="anchor" id="code"></a>

## <font color = 'red'> Challenges: </font>
+ Limited data: 
    + only review data (name, age, city, rating, review text, hours of sleep, sleeping partners) 
    + no channel attribution, guessing on certain demographic data
+ Unsupervised Learning


## Loading libraries and data

In [1]:
#load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
#load casper review dataset (scraped from Casper website)
casper = pd.read_csv('casper.csv')
casper.head()

Unnamed: 0,city,rating,name,partners,age,title,hours,state,review,date,verified,page
0,"\nChicago, Illinois\n",5,\nJoe W\n,Sleeps with a partner,\n35 years old\n,\nGreat Mattress\n,8,"\nChicago, Illinois\n",\nWe visited the Chicago store in Fulton Marke...,"\nFeb 13, 2018\n",\n,1
1,"\nBoston, Massachusetts\n",5,\nJ\n,Sleeps solo,\n33 years old\n,\nHighly Recommend\n,8,"\nBoston, Massachusetts\n",\nIve had my previous mattress for YEARS (WAY ...,"\nFeb 13, 2018\n",\n,1
2,"\nOwasso, Oklahoma\n",4,\nDavid Alessi\n,Sleeps solo,\n70 years old\n,\nI like it but am still getting used to it. \n,7,"\nOwasso, Oklahoma\n",\nThe unboxing was a challenge but we managed....,"\nFeb 13, 2018\n",\n,1
3,"\nBrownville, Maine\n",3,\nJeanne Hamlin\n,Sleeps with a partner,\n66 years old\n,\nokay\n,8,"\nBrownville, Maine\n",\nThe un-boxing was interesting. I was disapp...,"\nFeb 13, 2018\n",\n,1
4,"\nLive Oak, Texas\n",5,\nLETTY DELOACH\n,Sleeps with a partner plus a cat,\n37 years old\n,\nFinally found THE mattress!\n,8,"\nLive Oak, Texas\n",\nI absolutely love my Casper mattress! I was...,"\nFeb 12, 2018\n",\n,1


## Cleaning

In [3]:
#remove whitespace
casper = casper.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

In [4]:
#drop verified column
casper.drop(columns='verified', inplace=True)

In [5]:
#fix city, state, age
casper['age']=[item[0] for item in casper.age.str.split()]
casper.city = [item[0] for item in casper.state.str.split(',')]
casper.state = [item[-1] for item in casper.state.str.split(',')]

In [6]:
#inspect shape
casper.shape

(7777, 11)

In [7]:
#re-save to csv
casper.to_csv('casper2.csv')

In [8]:
#re-order columns
cols = ['date', 'name', 'age', 'city', 'state', 'rating', 'title', 'review', 'hours', 'partners', 'page']
casper = casper[cols]

In [9]:
#investigate
casper.dtypes

date        object
name        object
age         object
city        object
state       object
rating       int64
title       object
review      object
hours        int64
partners    object
page         int64
dtype: object

In [10]:
#convert date to datetime
# pd.to_datetime(casper['date'][3840:3850], errors="ignore")

#find error
months = [item[0] for item in casper.date.str.split()]
set(months)

{'Apr',
 'Aug',
 'Avr',
 'Dec',
 'Feb',
 'Jan',
 'Jul',
 'Jun',
 'Mai',
 'Mar',
 'May',
 'Nov',
 'Oct',
 'Sep'}

In [11]:
#inspect if Mai is supposed to be May or Mar
# casper[casper.date.str.contains('Mai').apply(np.sum)==1] #gives index
casper.date[7140:7250]

7140    May 20, 2015
7141    Mai 20, 2015
7142    Mai 19, 2015
7143    May 18, 2015
7144    May 18, 2015
7145    Mai 18, 2015
7146    Mai 18, 2015
7147    Mai 18, 2015
7148    May 18, 2015
7149    Mai 18, 2015
7150    Mai 18, 2015
7151    Mai 18, 2015
7152    Mai 17, 2015
7153    Mai 17, 2015
7154    May 17, 2015
7155    Mai 17, 2015
7156    Mai 17, 2015
7157    Mai 17, 2015
7158    Mai 16, 2015
7159    May 16, 2015
7160    Mai 16, 2015
7161    Mai 16, 2015
7162    Mai 16, 2015
7163    Mai 15, 2015
7164    Mai 15, 2015
7165    Mai 15, 2015
7166    May 15, 2015
7167    May 15, 2015
7168    May 14, 2015
7169    May 14, 2015
            ...     
7220    May 07, 2015
7221    May 07, 2015
7222    May 06, 2015
7223    May 06, 2015
7224    May 06, 2015
7225    May 06, 2015
7226    May 06, 2015
7227    May 06, 2015
7228    Mai 06, 2015
7229    May 06, 2015
7230    Mai 06, 2015
7231    Mai 06, 2015
7232    Mai 06, 2015
7233    Mai 06, 2015
7234    Mai 06, 2015
7235    May 06, 2015
7236    Mai 0

In [12]:
#correct misspellings and convert to datetime
casper['date'] = casper['date'].str.replace('Mai','May')
casper['date'] = casper['date'].str.replace('Avr','Apr')
casper['date'] = pd.to_datetime(casper['date'], errors="ignore")
casper.dtypes

date        datetime64[ns]
name                object
age                 object
city                object
state               object
rating               int64
title               object
review              object
hours                int64
partners            object
page                 int64
dtype: object

In [13]:
#convert age to int
casper.age = casper.age.str.replace('<', '')
casper.age = pd.to_numeric(casper.age)
casper.dtypes

date        datetime64[ns]
name                object
age                float64
city                object
state               object
rating               int64
title               object
review              object
hours                int64
partners            object
page                 int64
dtype: object

In [14]:
#inspect partners data
set(casper.partners)

{'Dort  plusieurs',
 'Dort  plusieurs avec un chat',
 'Dort  plusieurs avec un chien',
 'Dort  plusieurs avec un cochon',
 'Dort avec un chien',
 'Dort avec un(e) partenaire',
 'Dort avec un(e) partenaire et un chat',
 'Dort avec un(e) partenaire et un chien',
 'Dort seul',
 'Dort seul avec un chat',
 'Dort seul(e) avec un chien',
 'Dort seul(e) avec un cochon',
 'Sleeps solo',
 'Sleeps solo plus a cat',
 'Sleeps solo plus a dog',
 'Sleeps solo plus a pig',
 'Sleeps with a cat',
 'Sleeps with a dog',
 'Sleeps with a partner',
 'Sleeps with a partner plus a cat',
 'Sleeps with a partner plus a dog',
 'Sleeps with a partner plus a pig',
 'Sleeps with multiple partners',
 'Sleeps with multiple partners plus a cat',
 'Sleeps with multiple partners plus a dog',
 'Sleeps with multiple partners plus a pig'}

In [15]:
#split up partners info (include French)
casper['multiple_partners'] = (casper.partners.str.contains('plusieur') | 
                               casper.partners.str.contains('multiple'))
casper['single_partner'] = (casper.partners.str.contains('partenaire') | 
                               casper.partners.str.contains('a partner'))
casper['solo'] = (casper.partners.str.contains('seul') | 
                  casper.partners.str.contains('Sleeps with a cat') | 
                  casper.partners.str.contains('Sleeps with a dog') | 
                  casper.partners.str.contains('solo') |
                 casper.partners.str.contains('Dort avec un chien'))
casper['pets'] = (casper.partners.str.contains('cat') |
                  casper.partners.str.contains('chat') |
                  casper.partners.str.contains('dog') |
                  casper.partners.str.contains('chien') |
                  casper.partners.str.contains('pig') |
                  casper.partners.str.contains('cochen'))
casper.head(3)

Unnamed: 0,date,name,age,city,state,rating,title,review,hours,partners,page,multiple_partners,single_partner,solo,pets
0,2018-02-13,Joe W,35.0,Chicago,Illinois,5,Great Mattress,We visited the Chicago store in Fulton Market....,8,Sleeps with a partner,1,False,True,False,False
1,2018-02-13,J,33.0,Boston,Massachusetts,5,Highly Recommend,Ive had my previous mattress for YEARS (WAY ov...,8,Sleeps solo,1,False,False,True,False
2,2018-02-13,David Alessi,70.0,Owasso,Oklahoma,4,I like it but am still getting used to it.,The unboxing was a challenge but we managed. I...,7,Sleeps solo,1,False,False,True,False


In [16]:
#check if sums up correctly
sum(casper.single_partner) + sum(casper.multiple_partners) + sum(casper.solo)

7777

In [17]:
#add french column
casper['french'] = casper.partners.str.contains('Dort')

In [18]:
#re-save to csv
casper.to_csv('casper2.csv')

## Model: K-Means Clustering

In [20]:
casper = pd.read_csv('casper2.csv')
casper.dtypes

Unnamed: 0             int64
date                  object
name                  object
age                  float64
city                  object
state                 object
rating                 int64
title                 object
review                object
hours                  int64
partners              object
page                   int64
multiple_partners       bool
single_partner          bool
solo                    bool
pets                    bool
french                  bool
dtype: object

In [26]:
casper_subset = casper[['age', 'rating', 'hours', 'multiple_partners', 'single_partner', 'solo', 'pets', 'french']]
casper_subset.dropna(inplace=True)
casper_subset.astype(int, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,age,rating,hours,multiple_partners,single_partner,solo,pets,french
0,35,5,8,0,1,0,0,0
1,33,5,8,0,0,1,0,0
2,70,4,7,0,0,1,0,0
3,66,3,8,0,1,0,0,0
4,37,5,8,0,1,0,1,0
5,70,5,5,0,1,0,0,0
6,25,5,6,0,0,1,0,0
7,29,5,9,0,1,0,1,0
8,31,5,7,0,1,0,0,0
9,45,5,7,0,0,1,0,0


In [27]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
kmeans.fit(casper_subset)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)