# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [72]:
# Import pandas and any other libraries you need here.
import pandas as pd
import numpy as np

df = pd.read_csv(('Womens Clothing E-Commerce Reviews.csv'))
print(df)

       Unnamed: 0  Clothing ID  Age  \
0               0          767   33   
1               1         1080   34   
2               2         1077   60   
3               3         1049   50   
4               4          847   47   
...           ...          ...  ...   
23481       23481         1104   34   
23482       23482          862   48   
23483       23483         1104   31   
23484       23484         1084   28   
23485       23485         1104   52   

                                                   Title  \
0                                                    NaN   
1                                                    NaN   
2                                Some major design flaws   
3                                       My favorite buy!   
4                                       Flattering shirt   
...                                                  ...   
23481                     Great dress for many occasions   
23482                         Wish it was made of c

In [110]:
# Print out any information you need to understand your dataframe
print(df.columns)
print(df.shape) #11 columns, 23486 rows of data
print(df.head(5))

Index(['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating',
       'Recommended IND', 'Positive Feedback Count', 'Division Name',
       'Department Name', 'Class Name'],
      dtype='object')
(23486, 11)
   Unnamed: 0  Clothing ID  Age                    Title  \
0           0          767   33                      NaN   
1           1         1080   34                      NaN   
2           2         1077   60  Some major design flaws   
3           3         1049   50         My favorite buy!   
4           4          847   47         Flattering shirt   

                                         Review Text  Rating  Recommended IND  \
0  Absolutely wonderful - silky and sexy and comf...       4                1   
1  Love this dress!  it's sooo pretty.  i happene...       5                1   
2  I had such high hopes for this dress and reall...       3                0   
3  I love, love, love this jumpsuit. it's fun, fl...       5                1   
4  This shir

## Missing Data

Try out different methods to locate and resolve missing data.

In [298]:
# Try to find some missing data!

#there seems to be a lot of missing data in "Title", "Review Text", "Division Name", "Department Name", and "Class Name"
print("Missing Values Count:")
print(df.isnull().sum())
print(df)
df['Title'] = df['Title'].fillna('Not Provided')
df['Review Text'] = df['Review Text'].fillna('No Review Provided')
df["Division Name"] = df["Division Name"].fillna("Not Specified")
df["Department Name"] = df["Department Name"].fillna("Not Specified")
df["Class Name"] = df ["Class Name"].fillna("Not Specified")
print(df["Class Name"].isnull().sum())

Missing Values Count:
Unnamed: 0                 0
Clothing ID                0
Age                        0
Title                      0
Review Text                0
Rating                     0
Recommended IND            0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
dtype: int64
       Unnamed: 0  Clothing ID  Age  \
0               0          767   33   
1               1         1080   34   
2               2         1077   60   
3               3         1049   50   
4               4          847   47   
...           ...          ...  ...   
23481       23481         1104   34   
23482       23482          862   48   
23483       23483         1104   31   
23484       23484         1084   28   
23485       23485         1104   52   

                                                   Title  \
0                                           Not Provided   
1                                           Not Provided 

Did you find any missing data? What things worked well for you and what did not?

In [None]:
# Respond to the above questions here:? 
#I did find missing data...I was able to use .fillna() to fill in all missing values in the data set. 

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [313]:
# Keep an eye out for outliers!
print(df.describe())


         Unnamed: 0   Clothing ID           Age        Rating  \
count  23486.000000  23486.000000  23486.000000  23486.000000   
mean   11742.500000    918.118709     43.198544      4.196032   
std     6779.968547    203.298980     12.279544      1.110031   
min        0.000000      0.000000     18.000000      1.000000   
25%     5871.250000    861.000000     34.000000      4.000000   
50%    11742.500000    936.000000     41.000000      5.000000   
75%    17613.750000   1078.000000     52.000000      5.000000   
max    23485.000000   1205.000000     99.000000      5.000000   

       Recommended IND  Positive Feedback Count  
count     23486.000000             23486.000000  
mean          0.822362                 2.535936  
std           0.382216                 5.702202  
min           0.000000                 0.000000  
25%           1.000000                 0.000000  
50%           1.000000                 1.000000  
75%           1.000000                 3.000000  
max           

What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [None]:
# Make your notes here: I am uncertain of what outliers there may be. I don't see any outliers in the ratings, and I don't think that age range is particularly suspiscious looking, either. None of the data I am seeing seems that it should be thrown out--it may be pertinent information for our analyses.  

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplice column. Check out the dataset to see if there is any unnecessary data.

In [325]:
# Look out for unnecessary data!
#df.drop(columns = "Unnamed: 0", inplace=True)
df.drop(columns = "Positive Feedback Count", inplace=True)

Did you find any unnecessary data in your dataset? How did you handle it?

In [None]:
# Make your notes here.
# I looked over the overall dataset and tried to figure out if there were redundancies or information that is just not pertinent to this analysis. I decide to drop the Unnamed column as well as the Positive Feedback Count. I am not sure if I would drop one of the Class or Department names as well.  I went ahead and used the .drop() function-- I liked using the columns= way instead of the axis1...it seems cleaner to me

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [333]:
# Look out for inconsistent data!
print(df["Class Name"].unique())

['Intimates' 'Dresses' 'Pants' 'Blouses' 'Knits' 'Outerwear' 'Lounge'
 'Sweaters' 'Skirts' 'Fine gauge' 'Sleep' 'Jackets' 'Swim' 'Trend' 'Jeans'
 'Legwear' 'Shorts' 'Layering' 'Casual bottoms' 'Not Specified' 'Chemises']


Did you find any inconsistent data? What did you do to clean it?

In [None]:
# Make your notes here!
#I have not been able to find any inconsistencies in the dataset...I am not sure what I would change. I thought that maybe there could be inconsistencies i between overall rating and perhaps the recommendation rating?