# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [99]:
# Import pandas and any other libraries you need here.
import pandas as pd
import matplotlib as mpl
import numpy as np

# Create a new dataframe from your CSV
file_path = 'Womens Clothing E-Commerce Reviews.csv'
df = pd.read_csv(file_path)

In [100]:
# Print out any information you need to understand your dataframe
# df.shape
# df.info
# df.describe
df.columns

Index(['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
       'Recommended IND', 'Positive Feedback Count', 'Division Name',
       'Department Name', 'Class Name'],
      dtype='object')

## Missing Data

Try out different methods to locate and resolve missing data.

In [95]:
# Try to find some missing data!
title_replacement = {"Title": "Untitled Review"}
df= df.fillna(value=title_replacement)
text_replacement = {"Review Text": "Left Blank"}
df =df.fillna(value=text_replacement)
df = df.dropna(subset=['Division Name'],how= 'any')
df = df.dropna(subset=['Class Name'],how= 'any')
df = df.dropna(subset=['Department Name'],how= 'any')
# df.isna().sum()
print(df)



       Unnamed: 0  Clothing ID  Age  \
0               0          767   33   
1               1         1080   34   
2               2         1077   60   
3               3         1049   50   
4               4          847   47   
...           ...          ...  ...   
23481       23481         1104   34   
23482       23482          862   48   
23483       23483         1104   31   
23484       23484         1084   28   
23485       23485         1104   52   

                                                   Title  \
0                                        Untitled Review   
1                                        Untitled Review   
2                                Some major design flaws   
3                                       My favorite buy!   
4                                       Flattering shirt   
...                                                  ...   
23481                     Great dress for many occasions   
23482                         Wish it was made of c

Did you find any missing data? What things worked well for you and what did not?

In [None]:
# Respond to the above questions here:
# Some of the titles were missing, i replaced them with "Untitled Review" to be more descriptive.
# dropping any rows that were completely blank did using df.dropna(how="all") didnt do anything
# because non of the rows were completely blank so that was good.
# I did drop the rows that were missing the division name etc because 14 rows didnt seem too impactful to lose.

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [112]:
# Keep an eye out for outliers!

lower_quartile_age = df['Age'].quantile(0.25)
upper_quartile_age = df['Age'].quantile(0.75)
IQR = upper_quartile_age - lower_quartile_age
lowest_age = lower_quartile_age - 1.5 * IQR
highest_age = upper_quartile_age + 1.5* IQR
age_outlier = (df['Age'] < lowest_age) | (df['Age'] > highest_age)
df = df[~age_outlier]
df.describe()


Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,23329.0,23329.0,23329.0,23329.0,23329.0,23329.0
mean,11741.492006,918.192293,42.934031,4.194479,0.822067,2.536757
std,6777.742179,203.142638,11.882645,1.110004,0.382465,5.713438
min,0.0,0.0,18.0,1.0,0.0,0.0
25%,5870.0,861.0,34.0,4.0,1.0,0.0
50%,11744.0,936.0,41.0,5.0,1.0,1.0
75%,17610.0,1078.0,51.0,5.0,1.0,3.0
max,23485.0,1205.0,76.0,5.0,1.0,122.0


What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [None]:
# Make your notes here:
# Here using the IQR to identify outliers in the age columns was easy, some patrons could have been 99 years old but 
# eliminating them shouldnt affect the data too much.

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplice column. Check out the dataset to see if there is any unnecessary data.

In [119]:
# Look out for unnecessary data!

df = df.drop(columns=['Unnamed: 0'])
df.columns

Index(['Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
       'Recommended IND', 'Positive Feedback Count', 'Department Name'],
      dtype='object')

Did you find any unnecessary data in your dataset? How did you handle it?

In [None]:
# Make your notes here.
# Here i believe that the division name and class name columns are redundant witht he department name and the unnamed: 0 column seems to be sa repeat index

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [127]:
# Look out for inconsistent data!

df['Recommend IND'] = df['Recommended IND'].astype('bool')
df.dtypes

Clothing ID                 int64
Age                         int64
Title                      object
Review Text                object
Rating                      int64
Recommended IND             int64
Positive Feedback Count     int64
Department Name            object
Recommend IND                bool
dtype: object

Did you find any inconsistent data? What did you do to clean it?

In [None]:
# Make your notes here!
# Here i changes the datatype for the Recommended IND to a boolean value from a binary value to make it easier to read and hopefully to work with.