# 1. IMPORTING DATA AND LOADING NECESSARY LIBRARIES

In [None]:
# importing libraries

#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 
pd.set_option('display.max_columns', None)                              
pd.set_option('display.max_colwidth', None)                           
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt 
plt.style.use('dark_background')
import seaborn as sns                                              
sns.set(style='whitegrid')
sns.color_palette('dark')
%matplotlib inline
#------------------------------------------------------------------------------
from sklearn.preprocessing import StandardScaler,LabelEncoder
from sklearn.cluster import KMeans
from scipy.stats import zscore
from scipy.spatial import distance
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from sklearn.cluster import AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Normalizer

In [None]:
df = pd.read_csv('renttherunway.csv', index_col=False)

# 2. DATA CLEANSING AND EXPLORATORY DATA ANALYSIS

In [None]:
df.info()

In [None]:
#defining a function
def ifDuplicateSamples(data):
  NoOfDuplicateRows = data.duplicated().sum()
  if NoOfDuplicateRows == 0:
    print("There are no duplicate rows")
    return
  else:
    print("There are ",NoOfDuplicateRows,"duplicate rows")

#running it on current data
ifDuplicateSamples(df)

In [None]:
df.drop_duplicates(inplace=True)
df.duplicated().sum()

- Let's remove the unnecessary columns like text etc. I could have done sentiment analysis on them to extract something useful, but I will limit the scope of this project

In [None]:
df.drop(columns= ['Unnamed: 0','review_date', 'review_text','review_summary'],inplace=True)

In [None]:
#checking the unique users and items
print("Number of unique users: {user:,}".format(user=df["user_id"].nunique()))
print("Number of unique items: {item:,}".format(item=df["item_id"].nunique()))

- There are in total 192544 samples (transactions), out of them, 105,571 are unique users. 
- So on an average, one user makes two transactions. But we can be a little more specific and see how many make only one transaction.

In [None]:
df['user_id'].value_counts()

- Customer with user_id '691468' has made 436 transactions. 
- There are many such users with unexpectedly high number of transactions. 
- Let us dig a little deeper.

In [None]:
df[df['user_id'] == 691468].describe(include='all').T

- When it comes to personal bodily attributes like weight, height etc, there are no changes. They are all same, as expected. But the other details defffer based on the product she has rented for.
IMPORTANT ASSUMPTION:
- I am going to make an important assumption now. I will 
- I am majorly involved with customer segmentation. I have data for every customer, but there are many customers who make more that (some ranging from 200 to 400) transactions. I will not aggregate them. 
- I will take each transaction as a separate customer. This is an important ASSUMPTION that needs to be enumerated beforehand. 
- Keeping this is mind, I am going to remove attributes like 'user_id' and'item_id' which do not offer much to a sinlge transaction.

In [None]:
df.drop(columns=['user_id','item_id'],inplace=True)

- Let's rename a few feature names for for ease-of-access in pandas.
- I also make sure that bust size is renamed as 'bra_size' which is what the column values stand for. 

In [None]:
df.rename(columns={'bust size':'bra_size','rented for':'rented_for','body type':'body_type'}, inplace=True)

### Dealing with missing values

In [None]:
def missing_data(df):
  missing_data = pd.DataFrame({'net_missing': df.isnull().sum(), '%missing': (df.isnull().sum()/len(df))*100})
  print(missing_data)

missing_data(df)

- There are ample number of missing values in our data. 
- We need to do a little EDA to get a sense of the data before we can go ahead with missing value inputation. 
- An idea of outliers will also be helpful. Let us observe the data manually first.

In [None]:
df.describe().T

- There are three numerical features. Let's create a list for them.

**rating**
- rating goes from 2 to 10. 
- It has a few mising values. We will impute them based on data. 
- I would love to choose them based on the review text, but depends. Not many missing values. 
- I don't think there are outliers.

**size**
- It is the standardized size of the product
- size ranges from 0 to 58 
- mean is almost equal to median, signifying a normal distribution. 
- It was chosen by the customer, but labelled by the company.
- As it's a rent company, customers don't have the freedom to 'change' their size. And they can only 'choose.
- So, all sizes are valid. This column looks consistent, as it should. 
- But we will still check for outliers.

**age**
- age ranges from 0 to 117, which is obviously unreasonable.
- I will first check the customers who have put their age less than 15, and then those who have over 90, and check other values.
- It is reasonable to assume that some customers are private and they do not wish to share their details. But it still doesn't make sense for one to put their age as 117.


*Let me analyze the data feature by feature.*

### age

- First I will deal with 'age'
- Clearly there are too many outliers in this column, as I has expected. Let's first check the customers who put their age over 100

In [None]:
plt.figure(figsize=(14,7))
sns.boxplot(x=df['age'])
plt.title("Boxplot of age", fontsize=15)
plt.show()

- age ranges from 0 to 116, which is clearly bogus.
- I will first check for ages over 100

In [None]:
df.groupby('age').rating.mean()

In [None]:
df[df.age >= 100].describe().T

- What is funny about this subset is that almost all the ratings are either 10 or 8. I think that these are fakes entries put in order to better the ratings.
- I will remove all these rows from the dataset.

In [None]:
df.drop(df[df['age'] >= 100].index, inplace = True)

- Let's check for the subset of ages over 80 also, just to be sure.

In [None]:
df[df.age >= 65].describe().T

- Mean is 9.14, which is around the mean of the whole dataset.
- So we won't do much. Though it's quite abrasive to think that customers older than 65 would even care to rent outfits from an online store considering the demography at hand.
- Let's check a few samples.

In [None]:
df[df.age >= 65].head(10)

- To me, the data looks incongruous to my sensibilities. Almost all of them have given a rating of 10
- Also, there are not many samples belonging to this subset. I will remove them without any guilt.

In [None]:
df.drop(df[df['age'] >= 65].index, inplace = True)

- Now let's check for ages less than 10

In [None]:
df[df.age <= 10].sample(15)

In [None]:
df[df.age <= 10].describe().T

- Little kids are not supposed to have 'bust_size' of 32 or 38
- Almost all are rated 10
- Should be removed.
- Lets' check for age 10 to 15. Ideally we should remove them before looking because the company has options only for girls characterize as teens, which is over 15 if we go extreme.

In [None]:
df[(df.age >= 10) & (df.age <= 16)].sample(15)

In [None]:
df[(df.age >= 10) & (df.age <= 16)].describe().T

- Similar trend of suspiciously high ratings. I will remove them.
- Also, not many samples

In [None]:
df.drop(df[df['age'] <= 16].index, inplace = True)

- Let's be sure, and check the data between 15 to 18 once

In [None]:
df[(df.age > 16) & (df.age <= 18)].describe().T

- Customers of this age segment have given a mean rating of 9.3, which is still considerable. But I will keep them.
- Fake customers can be attacked in a different way. Let's check the boxplot before moving away.

In [None]:
df['age'].describe()

In [None]:
plt.figure(figsize=(14,7))
sns.boxplot(x=df['age'])
plt.title("Boxplot of age after first outlier treatment", fontsize=15)
plt.show()

- There are a few outliers. But they pertain to the uniqueness of data, as it tries to capture the entire population, females in focus.
- Let's do a log transformation, before going ahead. 

### size

- Now, let's see 'size' once again. It is the 'standardized size of the product' according to the data dictionary. So clearly, it is something printed on the products and catalogued, and filled into database based on the product rented by a particular customer.
- In America, where the company is located, the sizes range from 0 to 22, with no odd values. That is also the case on the website of renttherunway.
- But we observed that size values range upto 58 in this dataset. This is quite confusing. 
- So I check on its countplot to check the counts of possible sizes in this dataset

In [None]:
plt.figure(figsize=(14,7))
sns.countplot(data=df,x='size', palette='pastel')
plt.show()

- Too many customers have sizes as odd. So, either the data has already been processed, and the values have been given a different range or there is something inherent wrong about it. 
- The odd values cannot be explained even if consider that the sizes are a mix of American and European size possible value of dress size.
- Let me still have a glance of the customers who have a size over 28.

In [None]:
df[df['size']>28].describe().T

- There are over 6000 customers who have size greater than 28. A significant value.
- Most of them have rated highly (mean=9)
- age as well as size is normally distributed.
- I don't think I am going to do anything about it. It requires intervention of an expert who can explain what these values mean.
- For now, I am going to take this as it is. 
- I may decide to drop it after bivariate analysis, because size of the product is a property of the product. It is directly related to the customer through their body size (as in height, weight, body-type, and bra-size).
- I will explore these relationships and decide on this feature.

### rating

- Let me convert rating scale to 1-5: There are only even numbers (i.e. 2, 4, 6, 8, 10) so condensing it to this scale seems reasonable.

In [None]:
df[['age','rating']].describe().T

- Before going further, I will impute the age and rating by their median (which are almost same as mean) because I wish to preserver their integral integrity.
- Median is appropriate value to be filled in place of nulls.

In [None]:
df['age'].fillna(df['age'].median(), inplace=True)
df['rating'].fillna(df['rating'].median(), inplace=True)

### height and weight

- Let's focus on weight and height of the customers.
- First we will remove the lbs part from the weight feature values, keeping in mind that around 15% of them are missing.
- The we will convert the height in ft-inch to inch, keeping the values in inches for it's an American dataset.

In [None]:
# In the following command, I have extracted all the numerics (0-9) using extract method of string object
#we also convert it into numeric by using pandas' to_numeric method
df["weight"] = df["weight"].str.extract("([0-9]+)", expand=True).apply(pd.to_numeric)

In [None]:
# function to convert height into inches
# let height stand for the feature-column of the variable height in the dataframe
def to_inch(height):
  #extracting the numerical strings from the height values (of the form FT'IN'')
  height = height.str.extractall("([0-9]+)").reset_index()
  #the above function removes the ' and '' from the values and saves the two remaining numbers in height Series
  #creating two  new Series objects by extracting the fiest and second values stored in height Series
  # we use boolean filter using index values to achieve this. 
  feet = (height["match"] == 0)
  inch = (height["match"] == 1)
  #converting the feet and inch into numeric and multiplying feet by 12
  feet_changed = height[feet].drop(["level_0", "match"], axis=1).reset_index(drop=True).apply(pd.to_numeric) * 12
  inch_changed = height[inch].drop(['level_0','match'], axis=1).reset_index(drop=True).apply(pd.to_numeric)
  #adding the converted featrures and return them 
  return feet_changed + inch_changed

#running the function on the height column
df['height']=to_inch(df['height'])

- Now let us check the height and weight values to see if they make sense.
- First we check their boxplots.

In [None]:
plt.figure(figsize=(10,7))
df[['height','weight']].boxplot()
plt.title("Boxplots of Height and Weight", fontsize=15)
plt.show()

In [None]:
df[['height','weight','age']].describe().T

### weight

- From the boxplots, we felt that there are too many outliers in the weight column. But mean is almost equal to median, pointing towards a usual distribution of weights in a sample of population
- The average weight of the customers is 137 pounds, and the weights range from 50 to 300 pounds, with significant people lying in the weight range 135 to 148 (as seen in the third quartile.
- Everything looks consistent to me.
- And I will impute the null value with median.

In [None]:
df['weight'].fillna(df['weight'].median(), inplace=True)

### height

- height ranges from 54 to 78 inches, which seems reasonable for a sample human population
- the mean height of the given set of customers is almost same as median, indicating the presence of a normal distribution.
- I will impute the null values with mean

In [None]:
df['height'].fillna(df['height'].median(), inplace=True)

- We have seen the numeric features.
- Let's focus on the categorical ones.

### bra_size

- First try to understand how many types of values are present

In [None]:
df['bra_size'].value_counts()

- There are 106 types of bra sizes found in the customer dataset we have
- Each bra size, as the nomenclature is, is given by two numeric digits followed by an alphabetic character.
- The numeric digits correspond to the band size, and alphabetic characters correspond to the bust/cup size.
- Now, looking at the data we have, most of the customers have a band size of 34, which is the case worldwide. So, the data values are consistent.
- We can impute the values with mode of the all the possible customer values

In [None]:
df['bra_size'].fillna(df['bra_size'].mode()[0], inplace=True)

- I am going to split this column into two different columns
  - There are too many possible classes (106)
  - The algorithms I am going to use to cluster are centroid based, and do not perform well when there are too many categorical features.
  - For now, let's continue with the other features

### body_type

- We can already see that we have to impute missing values in this column with mode.
- But let us still be sure

In [None]:
df['body_type'].value_counts()

- There are seven classes. Most of the customers chose to go for a body hugging dress as is clear from their choice of hourglass, athletic or pear. 
- We will impute the missing values with mode. 
- After proper multivariate analysis, I might choose to 'reduce' the number of classes by considering one or two classes as same. 
- Or I might even drop this feature altogether.

In [None]:
df['body_type'].fillna(df['body_type'].mode()[0], inplace=True)

Let us perform mode imputation with rented_for as well.

In [None]:
df['rented_for'].fillna(df['rented_for'].mode()[0], inplace=True)

In [None]:
missing_data(df)

Thus, missing value imputation has been done.

- Now, we are ready to go ahead with EDA, feature engineering and processing.
- I will save the data in a new dataframe, and conntinue with the process in a separate page to make the entire thing more readable, and avoid refreshing and computing the values again and again.

In [None]:
df1 = df

In [None]:
df.to_csv('cleaned1_renttherunway.csv')

### bra_size

- We saw that this feature has too many classes. So clusteing would be a bit difficult for a very complex data.
- Also, bra size is actually is summation of two quantities. One is band size, reprensented by the numeral and other is the bust size, represented by the number.
- So, we can have two features representing this feature. And that's what I am going to do. I will split this feature.
- As in the case of weight, I will extract the numerals and alphabets using a basic regex command 'extract'.

In [None]:
df1["band_size"] = df1["bra_size"].str.extract("([0-9]+)", expand=True).apply(pd.to_numeric)

In [None]:
df1["cup_size"] = df1["bra_size"].str.extract("([a-z]+)", expand=True)

In [None]:
df1.drop(columns=['bra_size'],inplace=True)

### band_size

In [None]:
df1['band_size'].describe()

- There aren't many possble values

In [None]:
df1['band_size'].value_counts()

- band_size ranges from 28 to 48
- We are good to go as far these values are concerned.

### cup_size

In [None]:
df1['cup_size'].value_counts()

- These data points have been filled by the customers. 
- So some are 'ddd' while some are 'f' but both mean the same.
- A shallow dive into the problem from the perspective of business brings to us the fact fact tha
    - dd is e, and ddd is f
    - aa is something that comes before a in terms of size.
- Let's make some replacements

In [None]:
df1['cup_size'].replace({'dd':'e','ddd':'f'}, inplace=True)

- Now, I will label encode. I know that LabelEncoder() encodes data on the basis of alphabetical ordering. So it will take 'a' before 'aa' by default, which will be out of order. 
- So, before label encoding I will rename 'a' as 'ab', so encoding goes exactly as I want
- After label encoding the symbols will be replaced by the cup size (in inches) represented by them

In [None]:
df1['cup_size'].replace({'a':'ab'}, inplace=True)

In [None]:
le = LabelEncoder()
df1['cup_size'] = le.fit_transform(df1['cup_size'])
df1['cup_size'].value_counts()

- From cup_size, we will derive the bust size.
- From https://www.macys.com/p/bra-fit-guide/bra-size-fit-faq/, we know that:
    - bust_size - band_size = cup_size
- We have band and cup size, so we will have bust_size
    - bust_size = band_size + cup_size

In [None]:
df1['bust_size'] = df1['band_size'] + df1['cup_size']

In [None]:
df1['bust_size'].describe()

In [None]:
df1.drop(columns=['cup_size'],inplace=True)

- Now, I think that for a woman her bust and band size must have correlation.
- Let's check.

In [None]:
print(f"The correlation between band and bust size is: ")
print(np.corrcoef(df1['band_size'], df1['bust_size']))

- This is a very high correlation.

In [None]:
sns.regplot(data=df1,x='band_size',y='bust_size')

- A very interesting plot indeed.
- As expected, for one value of band_size, there are multiple values of bust_size.
- To preserve the information contained in both these variables, and do away with correlation, I am going to keep their dot product and save in a new feature called 'chest'.


In [None]:
df1['chest'] = df1['bust_size'] * df1['band_size']
df1.drop(columns=['bust_size','band_size'],inplace=True)

In [None]:
df1.info()

In [None]:
df1['chest'].describe()

### 'fit' and 'rating'

- fit and rating appear to be connected to each other. A customer is expected to rate a product highly if it fits her well. So, from a practical perspective there is definitely a connection.
- let us draw a countplot of rating with fit as hue to visualize the connection

In [None]:
sns.countplot(data=df1,x='rating',hue='fit')
plt.show()

- As is clear from the above graph, 
  - customers have rated a product highly if they fit them well 
  - if the product is small or large, then also, a lot of customers have rated the product highly. 
  - The count of people rating a product as 1,2 and 3 is significantly small as compared to the count of people rating the product as 4 and 5.  
  - So, customers are very specific about their rating if the product is small or large- that is they have rated them lowly, to the extent that one can, from statistical perspective, merge large and small into one subset.
  - Let's verify

In [None]:
pd.DataFrame(df1.groupby("fit")["rating"].value_counts())

- There is little statistical difference between the distribution of customers rating a produst as small or large. 
- The count of customers rating a product is 1,2,3,4 or 5 are almost same for the classes of small and large. So, we can merge into one as not fit. 
- So I am converting the feature fit into a binary one. 


In [None]:
df1["fit"] = np.where((df1["fit"] == 'fit'), 1, 0)

In [None]:
sns.countplot(data=df1,x='rating',hue='fit')
plt.show()

In [None]:
pd.DataFrame(df1.groupby("rating")["fit"].value_counts())

- The trend continues. It the product fits, the customers are going to rate it highly as in, either 4 or five
- The informations contained in the feature 'rating' also contains within itself the necessary information about the customer's choice.
- We do not really need the 'fit' feature in the context of customer segmentation.
- But to be sure, we will perform a Kruskal Wallis test (as rating is not normally distributed)

In [None]:
from scipy.stats import kruskal

In [None]:
stats.kruskal(df1['rating'][df1['fit'] == 0],\
               df1['rating'][df1['fit'] == 1])

- A very high value of F statistic and low value of p shows that we have sufficient proof to reject the null hypothesis that the median rating is same for both groups of fit- YES and NO. 
- We can claim that different rating values mean that fit did not happen.
- So the two features - 'rating' and 'fit' are dependent on each other, something we knew intuitively. A person won't rate somethiing highly unless they are sure if the item fits.
- So, I can get rid of any one of them. Or keep both of them by saving their dot product in a new vector.
- While multiplying the vectors, I will make sure to increment fit values by 1.

In [None]:
df1['response'] = df1['rating'] * (df1['fit']+1)

In [None]:
df1.drop(columns=['rating','fit'],inplace=True)

### rented_for

In [None]:
df1['rented_for'].value_counts()

- There are 9 classes of this feature, which is a rather huge number. 
- party and party: cocktail clearly belong to only one class- 'party'. Let's merge them

In [None]:
df1.loc[df1.rented_for=='party: cocktail','rented_for'] = 'party'

- If a customer (female, let's not forget) goes for a date and chooses 'expensive' rented clothes, she is going to wear similar dress for the ocassions of party. To clarify, I mean to argue that the occasion of party and date can be considered as same, if we focus only on the dress the woman chooses to wear.
- So I am reclassifying all the values with occasion as date int0 party.

In [None]:
df1.loc[df1.rented_for=='date','rented_for'] = 'party'

- Now, work is basically a formal affair.
- Someone may choose to go to work in simple tee and jeans, but they won't hire is specially from a company.
- So as the customer of renttherunway, a woman is buying for 'formal affair' if she is buying for work. 
- So I am reclassifying all the 'work' values as formal affair. 
- I am also renaming 'formal affair' as 'formal'.

In [None]:
df1['rented_for'].replace({'formal affair':'formal'},inplace=True)

In [None]:
df1.loc[df1.rented_for=='work','rented_for'] = 'formal'

In [None]:
df1['rented_for'].value_counts()

In [None]:
df1.loc[df1.rented_for=='vacation','rented_for'] = 'everyday'
df1.loc[df1.rented_for=='other','rented_for'] = 'everyday'

- Now, I have also categorized 'everyday','other', and 'vacation' into one. 
- The reason has more to do with statistics than real-life reasoning. 
- But of course, I have to say that when it comes to dresses, people tend to use similar types of clothes for everyday and vacation - comfortable. 
- The counts for these classes is also low, signifying a smaller cluster. I would like to see them together. I also want to reduce the sparsity of the data.
- The idea is to prepare data for the centroid based algorithms. If I had decided to go for DBSCAN or some other neural based clustering approach, I might have tackled the dataset in a different way.

In [None]:
df1['rented_for'].value_counts()

### category

- Let's analyze 'category' feature

In [None]:
df1['category'].value_counts()

- There are 68 classes of category of the item that the customers have the option to buy, considering they have exhausted all of them in this huge dataset.
- The range of counts of individual items is significant,
- This is going to create problems for the algorithms when they choose to cluster the data, and more so when we are limited ourselves to centroid based algorithms like KMeans and Agglomerative.
- A huge percentage of them are 'dress'.

In [None]:
dress_percent = len(df1.loc[df1['category']=='dress'])/len(df1.category)*100
print(f'The percentage of dress bought by customers is {dress_percent}')

- In fact, most of the people either rent dress or gown or sheath.
    - dress         92560
    - gown          44160
    - sheath        19227
- In total 155947 items belong to either of the above three categories, which is around 3/4th of the total count of transactions. 
- And it makes sense, as most of the people come here to rent for occasions like wedding or party.
- Now, I will club of the classes into one, and call it as 'others'.

In [None]:
df1['category'] = [x if x in {'dress','gown','sheath'} else 'others' for x in df1['category']]

In [None]:
df1['category'].value_counts()

### body _type and size

- let's first see what are the various values in the column

In [None]:
df1['body_type'].value_counts()

- body_type has the body type of customer
- it is filled by the customer
- And intuitively, it is easy to see that it has strong connections with the customer's weight, height, chest and size of the item they choose to rent.
- Let's look at them a little closely.

In [None]:
df1[['body_type','size','weight','height','chest']].sample(50)

- First of all, 'size' is quite dubious. We have no idea what measure has been used, as discussed earlier. This feature is either the beyond my scope (as my business understanding is limited) or there is something seriously fishy about the values in this column- something I strongly believe, based on my reading of the subject.
- But I do see a sort of correlation it has with weight. Let's check.

In [None]:
sns.regplot(data=df1,x='weight',y='size')
plt.show()

- There also seems be a very high correlation between weight and size, which can be substantiated with a corr plot

In [None]:
df1.corr()

- So I am going to drop it. 

In [None]:
df1.drop(columns=['size'],inplace=True)

- Now, looking at the feature classes of body_type, there seem to be many interesting names. 
- That is for marketting purpose. Or perhaps, the data analysis department of renttherunway has a definition of these terms.
- As for us, who are trying to feed them into ML algorithms, we cannot be sure about everything except for the relationship body_type PREMPTIVELY has with features like 'height','weight','band_size' and 'cup_size'.
- I can so far as to claim that body_type is a polynomial combination of the other features. And this is not just intuition. I see it from the 50 samples. 
- But, and here is my key argument- these values have been fed by the customers, many of whom were not aware of the temrs either. 
    - some were in a hurry, so they chose anything.
    - some wanted to show off and lied.
    - some were confused between apple and pear. 
- That makes 'body_type' a subjective variable. It has more to do customers' self perception that the reality.
- But their BIOLOGICAL features are a better source of their body_type. They are more OBJECTIVE.
- So, for the sake of objectivity, I am going to drop this feature. 


In [None]:
df1.drop(columns=['body_type'],inplace=True)

In [None]:
df1.info()

In [None]:
df2 = df1

In [None]:
df1.to_csv('cleaned2_renttherunway.csv')

# 3. Data Preparation for model building

- But before that, we have to 
  - one hot encode, and
  - standardize the data

In [None]:
le = LabelEncoder()

In [None]:
to_be_encoded = df2.select_dtypes(include='object').columns

In [None]:
for feature in to_be_encoded:
  df2[feature] = le.fit_transform(df2[feature])

In [None]:
df2.info()

In [None]:
## Standardization
scaled_features = StandardScaler().fit_transform(df2.values)
df2 = pd.DataFrame(scaled_features, index=df2.index, columns=df2.columns)

In [None]:
df2.head()

In [None]:
df2.corr()

In [None]:
df2.to_csv('prepared_renttherunway.csv')

In [None]:
df1 = pd.read_csv('cleaned1_renttherunway.csv', index_col=False)

In [None]:
df1.info()

### bra_size

- We saw that this feature has too many classes. So clusteing would be a bit difficult for a very complex data.
- Also, bra size is actually is summation of two quantities. One is band size, reprensented by the numeral and other is the bust size, represented by the number.
- So, we can have two features representing this feature. And that's what I am going to do. I will split this feature.
- As in the case of weight, I will extract the numerals and alphabets using a basic regex command 'extract'.

In [None]:
df1["band_size"] = df1["bra_size"].str.extract("([0-9]+)", expand=True).apply(pd.to_numeric)

In [None]:
df1["cup_size"] = df1["bra_size"].str.extract("([a-z]+)", expand=True)

In [None]:
df1.drop(columns=['bra_size','Unnamed: 0'],inplace=True)

### band_size

In [None]:
df1['band_size'].describe()

- There aren't many possble values

In [None]:
df1['band_size'].value_counts()

- band_size ranges from 28 to 48
- We are good to go as far these values are concerned.

### cup_size

In [None]:
df1['cup_size'].value_counts()

- These data points have been filled by the customers. 
- So some are 'ddd' while some are 'f' but both mean the same.
- A shallow dive into the problem from the perspective of business brings to us the fact fact tha
    - dd is e, and ddd is f
    - aa is something that comes before a in terms of size.
- Let's make some replacements

In [None]:
df1['cup_size'].replace({'dd':'e','ddd':'f'}, inplace=True)

- Now, I will label encode. I know that LabelEncoder() encodes data on the basis of alphabetical ordering. So it will take 'a' before 'aa' by default, which will be out of order. 
- So, before label encoding I will rename 'a' as 'ab', so encoding goes exactly as I want
- After label encoding the symbols will be replaced by the cup size (in inches) represented by them

In [None]:
df1['cup_size'].replace({'a':'ab'}, inplace=True)

In [None]:
le = LabelEncoder()
df1['cup_size'] = le.fit_transform(df1['cup_size'])
df1['cup_size'].value_counts()

- From cup_size, we will derive the bust size.
- From https://www.macys.com/p/bra-fit-guide/bra-size-fit-faq/, we know that:
    - bust_size - band_size = cup_size
- We have band and cup size, so we will have bust_size
    - bust_size = band_size + cup_size

In [None]:
df1['bust_size'] = df1['band_size'] + df1['cup_size']

In [None]:
df1['bust_size'].describe()

In [None]:
df1.drop(columns=['cup_size'],inplace=True)

- Now, I think that for a woman her bust and band size must have correlation.
- Let's check.

In [None]:
print(f"The correlation between band and bust size is: ")
print(np.corrcoef(df1['band_size'], df1['bust_size']))

- This is a very high correlation.

In [None]:
sns.regplot(data=df1,x='band_size',y='bust_size')

- A very interesting plot indeed.
- As expected, for one value of band_size, there are multiple values of bust_size.
- To preserve the information contained in both these variables, and do away with correlation, I am going to keep their dot product and save in a new feature called 'chest'.


In [None]:
df1['chest'] = df1['bust_size'] * df1['band_size']
df1.drop(columns=['bust_size','band_size'],inplace=True)

In [None]:
df1.info()

In [None]:
df1['chest'].describe()

### 'fit' and 'rating'

- fit and rating appear to be connected to each other. A customer is expected to rate a product highly if it fits her well. So, from a practical perspective there is definitely a connection.
- let us draw a countplot of rating with fit as hue to visualize the connection

In [None]:
sns.countplot(data=df1,x='rating',hue='fit')
plt.show()

- As is clear from the above graph, 
  - customers have rated a product highly if they fit them well 
  - if the product is small or large, then also, a lot of customers have rated the product highly. 
  - The count of people rating a product as 1,2 and 3 is significantly small as compared to the count of people rating the product as 4 and 5.  
  - So, customers are very specific about their rating if the product is small or large- that is they have rated them lowly, to the extent that one can, from statistical perspective, merge large and small into one subset.
  - Let's verify

In [None]:
pd.DataFrame(df1.groupby("fit")["rating"].value_counts())

- There is little statistical difference between the distribution of customers rating a produst as small or large. 
- The count of customers rating a product is 1,2,3,4 or 5 are almost same for the classes of small and large. So, we can merge into one as not fit. 
- So I am converting the feature fit into a binary one. 


In [None]:
df1["fit"] = np.where((df1["fit"] == 'fit'), 1, 0)

In [None]:
sns.countplot(data=df1,x='rating',hue='fit')
plt.show()

In [None]:
pd.DataFrame(df1.groupby("rating")["fit"].value_counts())

- The trend continues. It the product fits, the customers are going to rate it highly as in, either 4 or five
- The informations contained in the feature 'rating' also contains within itself the necessary information about the customer's choice.
- We do not really need the 'fit' feature in the context of customer segmentation.
- But to be sure, we will perform a Kruskal Wallis test (as rating is not normally distributed)

In [None]:
from scipy.stats import kruskal

In [None]:
stats.kruskal(df1['rating'][df1['fit'] == 0],\
               df1['rating'][df1['fit'] == 1])

- A very high value of F statistic and low value of p shows that we have sufficient proof to reject the null hypothesis that the median rating is same for both groups of fit- YES and NO. 
- We can claim that different rating values mean that fit did not happen.
- So the two features - 'rating' and 'fit' are dependent on each other, something we knew intuitively. A person won't rate somethiing highly unless they are sure if the item fits.
- So, I can get rid of any one of them. Or keep both of them by saving their dot product in a new vector.
- While multiplying the vectors, I will make sure to increment fit values by 1.

In [None]:
df1['response'] = df1['rating'] * (df1['fit']+1)

In [None]:
df1.drop(columns=['rating','fit'],inplace=True)

### rented_for

In [None]:
df1['rented_for'].value_counts()

- There are 9 classes of this feature, which is a rather huge number. 
- party and party: cocktail clearly belong to only one class- 'party'. Let's merge them

In [None]:
df1.loc[df1.rented_for=='party: cocktail','rented_for'] = 'party'

- If a customer (female, let's not forget) goes for a date and chooses 'expensive' rented clothes, she is going to wear similar dress for the ocassions of party. To clarify, I mean to argue that the occasion of party and date can be considered as same, if we focus only on the dress the woman chooses to wear.
- So I am reclassifying all the values with occasion as date int0 party.

In [None]:
df1.loc[df1.rented_for=='date','rented_for'] = 'party'

- Now, work is basically a formal affair.
- Someone may choose to go to work in simple tee and jeans, but they won't hire is specially from a company.
- So as the customer of renttherunway, a woman is buying for 'formal affair' if she is buying for work. 
- So I am reclassifying all the 'work' values as formal affair. 
- I am also renaming 'formal affair' as 'formal'.

In [None]:
df1['rented_for'].replace({'formal affair':'formal'},inplace=True)

In [None]:
df1.loc[df1.rented_for=='work','rented_for'] = 'formal'

In [None]:
df1['rented_for'].value_counts()

In [None]:
df1.loc[df1.rented_for=='vacation','rented_for'] = 'everyday'
df1.loc[df1.rented_for=='other','rented_for'] = 'everyday'

- Now, I have also categorized 'everyday','other', and 'vacation' into one. 
- The reason has more to do with statistics than real-life reasoning. 
- But of course, I have to say that when it comes to dresses, people tend to use similar types of clothes for everyday and vacation - comfortable. 
- The counts for these classes is also low, signifying a smaller cluster. I would like to see them together. I also want to reduce the sparsity of the data.
- The idea is to prepare data for the centroid based algorithms. If I had decided to go for DBSCAN or some other neural based clustering approach, I might have tackled the dataset in a different way.

In [None]:
df1['rented_for'].value_counts()

### category

- Let's analyze 'category' feature

In [None]:
df1['category'].value_counts()

- There are 68 classes of category of the item that the customers have the option to buy, considering they have exhausted all of them in this huge dataset.
- The range of counts of individual items is significant,
- This is going to create problems for the algorithms when they choose to cluster the data, and more so when we are limited ourselves to centroid based algorithms like KMeans and Agglomerative.
- A huge percentage of them are 'dress'.

In [None]:
dress_percent = len(df1.loc[df1['category']=='dress'])/len(df1.category)*100
print(f'The percentage of dress bought by customers is {dress_percent}')

- In fact, most of the people either rent dress or gown or sheath.
    - dress         92560
    - gown          44160
    - sheath        19227
- In total 155947 items belong to either of the above three categories, which is around 3/4th of the total count of transactions. 
- And it makes sense, as most of the people come here to rent for occasions like wedding or party.
- Now, I will club of the classes into one, and call it as 'others'.

In [None]:
df1['category'] = [x if x in {'dress','gown','sheath'} else 'others' for x in df1['category']]

In [None]:
df1['category'].value_counts()

### body _type and size

- let's first see what are the various values in the column

In [None]:
df1['body_type'].value_counts()

- body_type has the body type of customer
- it is filled by the customer
- And intuitively, it is easy to see that it has strong connections with the customer's weight, height, chest and size of the item they choose to rent.
- Let's look at them a little closely.

In [None]:
df1[['body_type','size','weight','height','chest']].sample(50)

- First of all, 'size' is quite dubious. We have no idea what measure has been used, as discussed earlier. This feature is either the beyond my scope (as my business understanding is limited) or there is something seriously fishy about the values in this column- something I strongly believe, based on my reading of the subject.
- But I do see a sort of correlation it has with weight. Let's check.

In [None]:
sns.regplot(data=df1,x='weight',y='size')
plt.show()

- There also seems be a very high correlation between weight and size, which can be substantiated with a corr plot

In [None]:
df1.corr()

- So I am going to drop it. 

In [None]:
df1.drop(columns=['size'],inplace=True)

- Now, looking at the feature classes of body_type, there seem to be many interesting names. 
- That is for marketting purpose. Or perhaps, the data analysis department of renttherunway has a definition of these terms.
- As for us, who are trying to feed them into ML algorithms, we cannot be sure about everything except for the relationship body_type PREMPTIVELY has with features like 'height','weight','band_size' and 'cup_size'.
- I can so far as to claim that body_type is a polynomial combination of the other features. And this is not just intuition. I see it from the 50 samples. 
- But, and here is my key argument- these values have been fed by the customers, many of whom were not aware of the temrs either. 
    - some were in a hurry, so they chose anything.
    - some wanted to show off and lied.
    - some were confused between apple and pear. 
- That makes 'body_type' a subjective variable. It has more to do customers' self perception that the reality.
- But their BIOLOGICAL features are a better source of their body_type. They are more OBJECTIVE.
- So, for the sake of objectivity, I am going to drop this feature. 


In [None]:
df1.drop(columns=['body_type'],inplace=True)

In [None]:
df1.info()

In [None]:
df2 = df1

In [None]:
df1.to_csv('cleaned2_renttherunway.csv')

In [None]:
df2.sample(15)

# 3. Data Preparation for model building

- But before that, we have to 
  - one hot encode, and
  - standardize the data

In [None]:
le = LabelEncoder()

In [None]:
to_be_encoded = df2.select_dtypes(include='object').columns

In [None]:
for feature in to_be_encoded:
  df2[feature] = le.fit_transform(df2[feature])

In [None]:
df2.info()

In [None]:
## Standardization
scaled_features = StandardScaler().fit_transform(df2.values)
df2 = pd.DataFrame(scaled_features, index=df2.index, columns=df2.columns)

In [None]:
df2.head()

In [None]:
df2.corr()

In [None]:
df2.to_csv('prepared_renttherunway.csv')

On df2, I will apply PCA, and df3 will be the one I will put into clustering algorithms

# 4. Principal Component Analysis and Clustering

## Dimensionality Reduction with Principal Component Analysis.

In [None]:
## Calculating covariance matrix
cov_matrix = np.cov(df3.T)
print('Covariance matrix','\n',cov_matrix)

In [None]:
## Calculating eigen values and eigen vectors
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
print('Eigen vectors:','\n',eig_vecs)
print('\n')
print('Eigen values:','\n',eig_vals)

In [None]:
total = sum(eig_vals)
var_exp = [ (i/total)*100  for i in sorted(eig_vals,reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print('Variance Explained: ',var_exp)
print('Cummulative Variance Explained: ',cum_var_exp)

In [None]:
plt.bar(range(7),var_exp, align='center',color='lightgreen',edgecolor='black',label='Indiviual Explained Varinace')
plt.step(range(7), cum_var_exp, where='mid',color='red',label='Cummulative explained Variance')
plt.legend(loc = 'best')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.tight_layout()
plt.show()

- We can see that approximately 93.5% of variance is explained by the first 6 features.
- so, we can choose the optimal number of principal components as 6.

In [None]:
pca=PCA(n_components = 6)
pca.fit(df3)

In [None]:
#transformed dataset after PCA is df4.
df4 = pca.transform(df3)
df4 = pd.DataFrame(df4,columns=['PC1','PC2','PC3','PC4','PC5','PC6'])
df4.head()

## K-means Clustering

In [None]:
cluster_range = range(1,12)
cluster_errors = []

for num_clusters in cluster_range:
    clusters = KMeans(num_clusters, init='k-means++', n_init=20, random_state=42)
    clusters.fit(df4)
    labels = clusters.labels_
    centroids = clusters.cluster_centers_
    cluster_errors.append(clusters.inertia_)
clusters_df = pd.DataFrame({'num_clusters':cluster_range, 
                           'cluster_errors':cluster_errors})

In [None]:
## Elbow method
plt.figure(figsize=[10,5])
plt.title('The Elbow Method')
plt.xlabel('Number of clusters using PCA')
plt.plot(clusters_df['num_clusters'],clusters_df['cluster_errors'],marker='o',color='b')
plt.show()

- From the Elbow plot, we can see that at K= 5 or 6, the interia starts to drop significantly. 
- We also calculate Silhoutte Scores for various possible clusters. We find that 5 gives us the best value. So, we will go ahead with 5 clusters.
- The clusters are labeled as 0,1,2,3,4.

In [None]:
## Fit the KMeans clustering model using the obtained optimal K
kmeans = KMeans(n_clusters=5, init='k-means++', n_init=20, random_state=42)
kmeans.fit(df4)

In [None]:
## Creating a new dataframe only for labels and converting it into categorical variables.
df_labels = pd.DataFrame(kmeans.labels_, columns=list(['Labels']))
df_labels['Labels'] = df_labels['Labels'].astype('category')
## joining the label dataframe with unscaled initial dataframe.(df)
df_kmeans = df3.join(df_labels)
df_kmeans.head()

In [None]:
df_kmeans['Labels'].value_counts()

### Silhoutte Score for validating the best optimal number of clusters.

In [None]:
from sklearn.metrics import silhouette_score

In [None]:
# I limit my study of clusters from 5 to 7

kmeans_score = []

for i in range(4,9):
    kmeans = KMeans(n_clusters=i, init='k-means++', n_init=10, random_state=42)
    kmeans = kmeans.fit(df4)
    labels = kmeans.predict(df4)
    print(i,'   ',silhouette_score(df4,labels))

 - From above, we can observe that for 5 and 6 clusters the silhoutte score is highest, we can choose optimal clusters as 5 or 6.

## Agglomerative Clustering

In [None]:
df4.info()

In [None]:
df4ac = df4.sample(frac=0.50)

In [None]:
df4ac.info()

In [None]:
plt.figure(figsize=[18,7])
merg = linkage(df4ac, method='ward')
dendrogram(merg, leaf_rotation=90,)
plt.xlabel('Datapoints')
plt.ylabel('Euclidean distance')
plt.show()

In [None]:
## Building hierarchical clustering model using the optimal clusters as 4
hie_cluster = AgglomerativeClustering(n_clusters=4, affinity='euclidean',
                                     linkage='ward')
hie_cluster_model = hie_cluster.fit(data_pca)

In [None]:
## Creating a dataframe of the labels
df_label1 = pd.DataFrame(hie_cluster_model.labels_,columns=['Labels'])
df_label1.head(5)

In [None]:
## joining the label dataframe with unscaled initial dataframe.(df)

df_hier = df1.join(df_label1)
df_hier.head()

### Q18. Compute Silhoutte Score for validating the best optimal number of classes.

In [None]:
for i in range(2,15):
    hier = AgglomerativeClustering(n_clusters=i)
    hier = hier.fit(data_pca)
    labels = hier.fit_predict(data_pca)
    print(i,silhouette_score(data_pca,labels))

- From above, we can observe that the silhouette score is highest for 6.

## Conclustion

- In this case study, we have attempted to cluster adult census dataset using K-means and agglomerative clustering and we also reduced the dimensionality of the dataset using PCA.
- We came up with 6 clusters using K-means and 4 classes using agglomerative clustering.
- Although selection of the clusters can be revised using Silhoutte score but for a general introductory part it is okay to visualize the plot (either elbow graph or dendrograms) and come up with a particular clusters size.
- Further, we can also do the cluster analysis by doing bivariate analysis between cluster labels and different features and understand the characteristics of different groups.