### Import the required libraries and load the data

#### 1. Load the required libraries and read the dataset

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
df = pd.read_csv(r'renttherunway.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,fit,user_id,bust size,item_id,weight,rating,rented for,review_text,body type,review_summary,category,height,size,age,review_date
0,0,fit,420272,34d,2260466,137lbs,10.0,vacation,An adorable romper! Belt and zipper were a lit...,hourglass,So many compliments!,romper,"5' 8""",14,28.0,"April 20, 2016"
1,1,fit,273551,34b,153475,132lbs,10.0,other,I rented this dress for a photo shoot. The the...,straight & narrow,I felt so glamourous!!!,gown,"5' 6""",12,36.0,"June 18, 2013"
2,2,fit,360448,,1063761,,10.0,party,This hugged in all the right places! It was a ...,,It was a great time to celebrate the (almost) ...,sheath,"5' 4""",4,116.0,"December 14, 2015"
3,3,fit,909926,34c,126335,135lbs,8.0,formal affair,I rented this for my company's black tie award...,pear,Dress arrived on time and in perfect condition.,dress,"5' 5""",8,34.0,"February 12, 2014"
4,4,fit,151944,34b,616682,145lbs,10.0,wedding,I have always been petite in my upper body and...,athletic,Was in love with this dress !!!,gown,"5' 9""",12,27.0,"September 26, 2016"


#### 2. Check the first few samples, shape, info of the data and try to familiarize yourself with different features

In [5]:
df.shape

(192544, 16)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192544 entries, 0 to 192543
Data columns (total 16 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Unnamed: 0      192544 non-null  int64  
 1   fit             192544 non-null  object 
 2   user_id         192544 non-null  int64  
 3   bust size       174133 non-null  object 
 4   item_id         192544 non-null  int64  
 5   weight          162562 non-null  object 
 6   rating          192462 non-null  float64
 7   rented for      192534 non-null  object 
 8   review_text     192482 non-null  object 
 9   body type       177907 non-null  object 
 10  review_summary  192199 non-null  object 
 11  category        192544 non-null  object 
 12  height          191867 non-null  object 
 13  size            192544 non-null  int64  
 14  age             191584 non-null  float64
 15  review_date     192544 non-null  object 
dtypes: float64(2), int64(4), object(10)
memory usage: 23.5+ 

### Data cleansing and Exploratory data analysis:

#### 3. Check if there are any duplicate records in the dataset? If any, drop them.

In [7]:
df[df.duplicated()]

Unnamed: 0.1,Unnamed: 0,fit,user_id,bust size,item_id,weight,rating,rented for,review_text,body type,review_summary,category,height,size,age,review_date


In [None]:
# As per above, there are no duplicates

#### 4. Drop the columns which you think redundant for the analysis.(Hint: drop columns like ‘id’, ‘review’)

In [11]:
## dropping the unnamed 
df.drop(df.filter(regex="Unname"),axis=1, inplace=True)

In [13]:
## dropping the columns related to review
df.drop(df.filter(regex="review_"),axis=1, inplace=True)

In [14]:
df.head()

Unnamed: 0,fit,user_id,bust size,item_id,weight,rating,rented for,body type,category,height,size,age
0,fit,420272,34d,2260466,137lbs,10.0,vacation,hourglass,romper,"5' 8""",14,28.0
1,fit,273551,34b,153475,132lbs,10.0,other,straight & narrow,gown,"5' 6""",12,36.0
2,fit,360448,,1063761,,10.0,party,,sheath,"5' 4""",4,116.0
3,fit,909926,34c,126335,135lbs,8.0,formal affair,pear,dress,"5' 5""",8,34.0
4,fit,151944,34b,616682,145lbs,10.0,wedding,athletic,gown,"5' 9""",12,27.0


#### 5. Check the column 'weight', Is there any presence of string data? If yes, remove the string data and convert to float. (Hint: 'weight' has the suffix as lbs)

In [17]:
df['weight'] = df['weight'].str.replace('lbs','')

In [19]:
df['weight'] = df['weight'].fillna(0)

In [27]:
df['weight'] = df['weight'].astype(float)

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192544 entries, 0 to 192543
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   fit         192544 non-null  object 
 1   user_id     192544 non-null  int64  
 2   bust size   174133 non-null  object 
 3   item_id     192544 non-null  int64  
 4   weight      192544 non-null  float64
 5   rating      192462 non-null  float64
 6   rented for  192534 non-null  object 
 7   body type   177907 non-null  object 
 8   category    192544 non-null  object 
 9   height      191867 non-null  object 
 10  size        192544 non-null  int64  
 11  age         191584 non-null  float64
dtypes: float64(3), int64(3), object(6)
memory usage: 17.6+ MB


#### 6. Check the unique categories for the column 'rented for' and group 'party: cocktail' category with 'party'.

In [29]:
df['rented for'].unique()

array(['vacation', 'other', 'party', 'formal affair', 'wedding', 'date',
       'everyday', 'work', nan, 'party: cocktail'], dtype=object)

In [31]:
df['rented for'] = df['rented for'].replace('party: cocktail','party')

In [32]:
df['rented for'].unique()

array(['vacation', 'other', 'party', 'formal affair', 'wedding', 'date',
       'everyday', 'work', nan], dtype=object)

#### 7. The column 'height' is in feet with a quotation mark, Convert to inches with float datatype.

#### 8. Check for missing values in each column of the dataset? If it exists, impute them with appropriate methods.

#### 9. Check the statistical summary for the numerical and categorical columns and write your findings.

#### 10. Are there outliers present in the column age? If yes, treat them with the appropriate method.

#### 11. Check the distribution of the different categories in the column 'rented for' using appropriate plot.

### Data Preparation for model building:

#### 12. Encode the categorical variables in the dataset.

#### 13. Standardize the data, so that the values are within a particular range.