# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [11]:
# Import pandas and any other libraries you need here.
import pandas as pd


# Create a new dataframe from your CSV
df = pd.read_csv("Womens Clothing E-Commerce Reviews.csv")
print(df.head())

   Unnamed: 0  Clothing ID  Age                    Title  \
0           0          767   33                      NaN   
1           1         1080   34                      NaN   
2           2         1077   60  Some major design flaws   
3           3         1049   50         My favorite buy!   
4           4          847   47         Flattering shirt   

                                         Review Text  Rating  Recommended IND  \
0  Absolutely wonderful - silky and sexy and comf...       4                1   
1  Love this dress!  it's sooo pretty.  i happene...       5                1   
2  I had such high hopes for this dress and reall...       3                0   
3  I love, love, love this jumpsuit. it's fun, fl...       5                1   
4  This shirt is very flattering to all due to th...       5                1   

   Positive Feedback Count   Division Name Department Name Class Name  
0                        0       Initmates        Intimate  Intimates  
1       

In [12]:
# Print out any information you need to understand your dataframe
print(df.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB
None


## Missing Data

Try out different methods to locate and resolve missing data.

In [18]:
# Try to find some missing data!
print(df.isna())


       Unnamed: 0  Clothing ID    Age  Title  Review Text  Rating  \
0           False        False  False   True        False   False   
1           False        False  False   True        False   False   
2           False        False  False  False        False   False   
3           False        False  False  False        False   False   
4           False        False  False  False        False   False   
...           ...          ...    ...    ...          ...     ...   
23481       False        False  False  False        False   False   
23482       False        False  False  False        False   False   
23483       False        False  False  False        False   False   
23484       False        False  False  False        False   False   
23485       False        False  False  False        False   False   

       Recommended IND  Positive Feedback Count  Division Name  \
0                False                    False          False   
1                False                 

In [19]:
# Try to find some missing data!
print(df.isna());

cols = {"Title": "Unknown"}
print(df.fillna(value=cols))

       Unnamed: 0  Clothing ID    Age  Title  Review Text  Rating  \
0           False        False  False   True        False   False   
1           False        False  False   True        False   False   
2           False        False  False  False        False   False   
3           False        False  False  False        False   False   
4           False        False  False  False        False   False   
...           ...          ...    ...    ...          ...     ...   
23481       False        False  False  False        False   False   
23482       False        False  False  False        False   False   
23483       False        False  False  False        False   False   
23484       False        False  False  False        False   False   
23485       False        False  False  False        False   False   

       Recommended IND  Positive Feedback Count  Division Name  \
0                False                    False          False   
1                False                 

Did you find any missing data? What things worked well for you and what did not?

In [None]:
# Respond to the above questions here: Yes, I did find some missing data. Everything worked well for me, no difficulties.

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [20]:
# Keep an eye out for outliers!
print(df.describe())

         Unnamed: 0   Clothing ID           Age        Rating  \
count  23486.000000  23486.000000  23486.000000  23486.000000   
mean   11742.500000    918.118709     43.198544      4.196032   
std     6779.968547    203.298980     12.279544      1.110031   
min        0.000000      0.000000     18.000000      1.000000   
25%     5871.250000    861.000000     34.000000      4.000000   
50%    11742.500000    936.000000     41.000000      5.000000   
75%    17613.750000   1078.000000     52.000000      5.000000   
max    23485.000000   1205.000000     99.000000      5.000000   

       Recommended IND  Positive Feedback Count  
count     23486.000000             23486.000000  
mean          0.822362                 2.535936  
std           0.382216                 5.702202  
min           0.000000                 0.000000  
25%           1.000000                 0.000000  
50%           1.000000                 1.000000  
75%           1.000000                 3.000000  
max           

What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [None]:
# Make your notes here: I used the function .describe() to help find 

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplice column. Check out the dataset to see if there is any unnecessary data.

In [22]:
# Look out for unnecessary data!
print(df.drop(columns=['Review Text']))

       Unnamed: 0  Clothing ID  Age  \
0               0          767   33   
1               1         1080   34   
2               2         1077   60   
3               3         1049   50   
4               4          847   47   
...           ...          ...  ...   
23481       23481         1104   34   
23482       23482          862   48   
23483       23483         1104   31   
23484       23484         1084   28   
23485       23485         1104   52   

                                                   Title  Rating  \
0                                                    NaN       4   
1                                                    NaN       5   
2                                Some major design flaws       3   
3                                       My favorite buy!       5   
4                                       Flattering shirt       5   
...                                                  ...     ...   
23481                     Great dress for many occasion

Did you find any unnecessary data in your dataset? How did you handle it?

In [None]:
# Make your notes here. I considered the 'Review Text' column unnecessary. I used drop() function to get rid of it.

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [None]:
# Look out for inconsistent data!

Did you find any inconsistent data? What did you do to clean it?

In [None]:
# Make your notes here! I personally did not see anything inconsistent within this data. 