# Google Play Store Data Cleaning
We decided to use a data set from the Google Play Store. This notebook contains the steps we took in order to clean up our data to get it exactly how we wanted it.

In [43]:
# Import packages
import pandas as pd
import numpy as np

In [44]:
# Read data set
csvpath = 'Resources/''googleplaystore.csv'
df = pd.read_csv(csvpath, encoding='UTF-8')

### The original shape of the data frame and its types.

In [45]:
# Original df
df.shape

(10839, 13)

In [46]:
df.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

### For this step, we wanted to remove some columns. 
At the start of data cleaning, we went over what pieces of the data we thought would be useful. We decided that the versioning and last updated columns would not flow with what we wanted to show in this project. We also removed the Genres column as it appeared to be hand written and had a few errors as well as being redundant. (i.e. Genre: Education;education showed up several times) The column holding the official app store category would suffice for our needs.

In [47]:
# Delete versioning, genres and last updated columns, deemed unnecessary
del df['Android Ver']
del df['Current Ver']
del df['Last Updated']
del df['Genres']
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone


In [48]:
# Modified df shape
df.shape

(10839, 9)

### Removing the NaNs from Ratings.
We knew early on that we wanted to work with the ratings column. Unfortunately, we found several NaNs that were hard coded into the series. There was no explanation for why that app had a NaN for its rating. Rather than remove those rows entirely, we created a variable tied to the original data frame with the NaNs removed.

In [49]:
# Removes NaNs from df 
df_Ratings = df[pd.notnull(df['Rating'])]
df_Ratings.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone


In [50]:
# DF Ratings shape
df_Ratings.shape

(9366, 9)

### Removing strings from Size column.
We determined our other main column we would want to use is the Size column. This column contains the sizes of our apps. The problem here is that some devices had "Varies with device" as its size. Obviously, we cannot perform mathematic equations on a vague string. Like the NaNs before it, we recreated another df to remove this string without affecting the rest of the data sets.

In [51]:
# Remove 'Varies with device' from Size column
df_Size = df[df.Size != 'Varies with device']
df_Size.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone


In [52]:
# DF Size shape
df_Size.shape

(9145, 9)

## Cleaning data to make size formatting all the same

### Prepping data to plot for average installs vs size
We had to modify the data frame due to issues with the size and installs columns. The sizes were not all in a standard measurement, we decided to convert everything to MBs. The results from our calculations on installs were giving numbers in extreme ranges, making it hard to plot. We decided dividing those values by 1000, then specifying this on the plot, was the cleanest route.

In [53]:
# Create data frame containing data for comparing installs and size
sub_df = df_Size[['App','Category','Size','Installs','Type']].copy()

# Clean up installs column
sub_df['Installs'] = sub_df['Installs'].str.replace('+','')
sub_df['Installs'] = sub_df['Installs'].str.replace(',','')
sub_df['Installs'] = sub_df['Installs'].astype('float64')

# Convert kb values to mb and remove suffix, make DF
kb_df = sub_df.loc[sub_df['Size'].str.endswith('k')].copy()
kb_df['Size'] = kb_df['Size'].str.replace('k','').astype('float64')/1000

# Remove suffix from mb values, make DF
mb_df = sub_df.loc[sub_df['Size'].str.endswith('M')].copy()
mb_df['Size'] = mb_df['Size'].str.replace('M','').astype('float64')

# Concat the two frames
main_df = pd.concat([kb_df,mb_df])

# Calculate average installs and sizes per category
df_InstallsSizeCat = pd.DataFrame(main_df.groupby('Category')['Installs'].mean()/1000)
df_InstallsSizeCat = df_InstallsSizeCat.rename(columns = {'Installs':'Average Installs'})
df_InstallsSizeCat['Avg_install_size'] = main_df.groupby('Category')['Size'].mean()
df_InstallsSizeCat = df_InstallsSizeCat.rename(columns = {'Avg_install_size':'Average Install Size'})

# Reset Index
df_InstallsSizeCat = df_InstallsSizeCat.reset_index()
df_InstallsSizeApp = df_InstallsSizeCat.reset_index(drop=True)

df_InstallsSizeCat.head()

Unnamed: 0,Category,Average Installs,Average Install Size
0,ART_AND_DESIGN,1602.227419,12.370968
1,AUTO_AND_VEHICLES,583.602813,20.037147
2,BEAUTY,291.424468,13.795745
3,BOOKS_AND_REFERENCE,710.467391,13.310822
4,BUSINESS,1340.1964,14.472162


In [54]:
# Calculate average installs and sizes per category
df_AppInstalls = pd.DataFrame(main_df[['App','Size','Installs']])
df_AppInstalls['Installs (in Thousands)']=df_AppInstalls['Installs']/1000

df_AppInstalls=df_AppInstalls.reset_index(drop=True)
df_AppInstalls.head()

Unnamed: 0,App,Size,Installs,Installs (in Thousands)
0,Restart Navigator,0.201,100000.0,100.0
1,Plugin:AOT v5.0,0.023,100000.0,100.0
2,Hangouts Dialer - Call Phones,0.079,10000000.0,10000.0
3,Caller ID +,0.118,1000000.0,1000.0
4,GO Notifier,0.695,10000000.0,10000.0
