# Appendix: Data cleaning description

## Collection

This dataset was downloaded from https://www.kaggle.com/lava18/google-play-store-apps.

## Cleaning

In [1]:
import pandas as pd
import numpy as np

In [2]:
playstore_raw = pd.read_csv('googleplaystore.csv')
playstore = playstore_raw.copy()
playstore.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


### Rename the columns

In [3]:
new_colnames = list(playstore.columns)
for i in range(len(new_colnames)):
    new_colnames[i] = new_colnames[i].lower()
    new_colnames[i] = new_colnames[i].replace(' ', '_')
playstore.columns = new_colnames
playstore.head()

Unnamed: 0,app,category,rating,reviews,size,installs,type,price,content_rating,genres,last_updated,current_ver,android_ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


### Convert data types if necessary

#### See the types of each column

In [4]:
playstore.dtypes

app                object
category           object
rating            float64
reviews            object
size               object
installs           object
type               object
price              object
content_rating     object
genres             object
last_updated       object
current_ver        object
android_ver        object
dtype: object

#### Convert reviews to type int

When using to_numeric to convert the 'reviews' column to type int, an error occurred because the row at index 10472 was missing the 'category' value, which led to subsequent columns being shifted to the left.

In [5]:
playstore.iloc[10472]

app               Life Made WI-Fi Touchscreen Photo Frame
category                                              1.9
rating                                                 19
reviews                                              3.0M
size                                               1,000+
installs                                             Free
type                                                    0
price                                            Everyone
content_rating                                        NaN
genres                                  February 11, 2018
last_updated                                       1.0.19
current_ver                                    4.0 and up
android_ver                                           NaN
Name: 10472, dtype: object

Correct the data in the row at index 10472.  
I looked up "Life Made WI-Fi Touchscreen Photo Frame" on the Play Store, and the category for the app is Lifestyle.

In [6]:
playstore.iloc[[10472]] = playstore.iloc[[10472]].shift(periods = 1, axis = 'columns')

In [7]:
playstore.loc[10472, 'app'] = playstore.loc[10472, 'category']
playstore.loc[10472, 'category'] = 'LIFESTYLE'
playstore.loc[10472, 'rating'] = 1.9
playstore.loc[10472, 'reviews'] = 19
playstore.iloc[10472]

app               Life Made WI-Fi Touchscreen Photo Frame
category                                        LIFESTYLE
rating                                                1.9
reviews                                                19
size                                                 3.0M
installs                                           1,000+
type                                                 Free
price                                                   0
content_rating                                   Everyone
genres                                                NaN
last_updated                            February 11, 2018
current_ver                                        1.0.19
android_ver                                    4.0 and up
Name: 10472, dtype: object

Convert the 'reviews' column to type int.

In [8]:
playstore['reviews'] = pd.to_numeric(playstore['reviews'])
playstore['reviews'].dtype

dtype('int64')

#### Convert price to type float

Replace '$' with empty strings, then convert to float.

In [9]:
playstore['price'] = pd.to_numeric(playstore['price'].str.replace('$', ''))
playstore['price'].dtype

dtype('float64')

#### Convert last_updated to type datetime

In [10]:
playstore['last_updated'] = pd.to_datetime(playstore['last_updated'], format = '%B %d, %Y')
playstore['last_updated'].dtype

dtype('<M8[ns]')

### Clean the installs column

Combine categories, e.g. '5,000+' is combined into the '1,000+' category.

In [11]:
temp_installs = playstore['installs'].str.replace('5', '1')
temp_installs = temp_installs.str.replace('^0\+', '1+')
temp_installs = temp_installs.str.replace(',', '')

playstore['installs'] = temp_installs

playstore['installs'].value_counts().sort_index()

0                 1
1+              163
10+             591
100+           1049
1000+          1385
10000+         1533
100000+        1708
1000000+       2331
10000000+      1541
100000000+      481
1000000000+      58
Name: installs, dtype: int64

### Other modifications

#### Delete ' and up' in the android_ver column

In [12]:
playstore['android_ver'] = playstore['android_ver'].str.replace(' and up', '')

#### Make the values in the category column lowercase

In [13]:
playstore['category'] = playstore['category'].str.lower()

#### Drop the type column

The 'type' column only has two values, and the 'price' column can convey the same information: if the price is 0, the type is Free.

In [14]:
playstore['type'].value_counts()

Free    10040
Paid      800
Name: type, dtype: int64

In [15]:
playstore = playstore.drop(columns = 'type')

#### Drop the size column

The size of the app is not relevant in the majority of cases.

In [16]:
playstore = playstore.drop(columns = 'size')

### Remove redundant entries in the app column

See that there are duplicate entries for the 'app' column.

In [17]:
playstore['app'].value_counts()

ROBLOX                                                9
CBS Sports App - Scores, News, Stats & Watch Live     8
8 Ball Pool                                           7
Duolingo: Learn Languages Free                        7
ESPN                                                  7
Candy Crush Saga                                      7
Bowmasters                                            6
Temple Run 2                                          6
Sniper 3D Gun Shooter: Free Shooting Games - FPS      6
Helix Jump                                            6
Bubble Shooter                                        6
Bleacher Report: sports news, scores, & highlights    6
Subway Surfers                                        6
Nick                                                  6
Zombie Catchers                                       6
slither.io                                            6
Skyscanner                                            5
TripAdvisor Hotels Flights Restaurants Attractio

After printing out some of the duplicate entries to see if there's anything different between them, it appears that 'category' and 'reviews' (and sometimes 'last_updated') are the main differences.

In [18]:
playstore[playstore['app'] == 'ROBLOX']

Unnamed: 0,app,category,rating,reviews,installs,price,content_rating,genres,last_updated,current_ver,android_ver
1653,ROBLOX,game,4.5,4447388,100000000+,0.0,Everyone 10+,Adventure;Action & Adventure,2018-07-31,2.347.225742,4.1
1701,ROBLOX,game,4.5,4447346,100000000+,0.0,Everyone 10+,Adventure;Action & Adventure,2018-07-31,2.347.225742,4.1
1748,ROBLOX,game,4.5,4448791,100000000+,0.0,Everyone 10+,Adventure;Action & Adventure,2018-07-31,2.347.225742,4.1
1841,ROBLOX,game,4.5,4449882,100000000+,0.0,Everyone 10+,Adventure;Action & Adventure,2018-07-31,2.347.225742,4.1
1870,ROBLOX,game,4.5,4449910,100000000+,0.0,Everyone 10+,Adventure;Action & Adventure,2018-07-31,2.347.225742,4.1
2016,ROBLOX,family,4.5,4449910,100000000+,0.0,Everyone 10+,Adventure;Action & Adventure,2018-07-31,2.347.225742,4.1
2088,ROBLOX,family,4.5,4450855,100000000+,0.0,Everyone 10+,Adventure;Action & Adventure,2018-07-31,2.347.225742,4.1
2206,ROBLOX,family,4.5,4450890,100000000+,0.0,Everyone 10+,Adventure;Action & Adventure,2018-07-31,2.347.225742,4.1
4527,ROBLOX,family,4.5,4443407,100000000+,0.0,Everyone 10+,Adventure;Action & Adventure,2018-07-31,2.347.225742,4.1


In [19]:
playstore[playstore['app'] == 'CBS Sports App - Scores, News, Stats & Watch Live']

Unnamed: 0,app,category,rating,reviews,installs,price,content_rating,genres,last_updated,current_ver,android_ver
2976,"CBS Sports App - Scores, News, Stats & Watch Live",sports,4.3,91031,1000000+,0.0,Everyone,Sports,2018-08-04,Varies with device,5.0
3007,"CBS Sports App - Scores, News, Stats & Watch Live",sports,4.3,91031,1000000+,0.0,Everyone,Sports,2018-08-04,Varies with device,5.0
3015,"CBS Sports App - Scores, News, Stats & Watch Live",sports,4.3,91031,1000000+,0.0,Everyone,Sports,2018-08-04,Varies with device,5.0
3020,"CBS Sports App - Scores, News, Stats & Watch Live",sports,4.3,91031,1000000+,0.0,Everyone,Sports,2018-08-04,Varies with device,5.0
3056,"CBS Sports App - Scores, News, Stats & Watch Live",sports,4.3,91033,1000000+,0.0,Everyone,Sports,2018-08-04,Varies with device,5.0
3064,"CBS Sports App - Scores, News, Stats & Watch Live",sports,4.3,91033,1000000+,0.0,Everyone,Sports,2018-08-04,Varies with device,5.0
3090,"CBS Sports App - Scores, News, Stats & Watch Live",sports,4.3,91033,1000000+,0.0,Everyone,Sports,2018-08-04,Varies with device,5.0
9594,"CBS Sports App - Scores, News, Stats & Watch Live",sports,4.3,91035,1000000+,0.0,Everyone,Sports,2018-08-04,Varies with device,5.0


In [20]:
playstore[playstore['app'] == 'Candy Crush Saga']

Unnamed: 0,app,category,rating,reviews,installs,price,content_rating,genres,last_updated,current_ver,android_ver
1655,Candy Crush Saga,game,4.4,22426677,100000000+,0.0,Everyone,Casual,2018-07-05,1.129.0.2,4.1
1705,Candy Crush Saga,game,4.4,22428456,100000000+,0.0,Everyone,Casual,2018-07-05,1.129.0.2,4.1
1751,Candy Crush Saga,game,4.4,22428456,100000000+,0.0,Everyone,Casual,2018-07-05,1.129.0.2,4.1
1842,Candy Crush Saga,game,4.4,22429716,100000000+,0.0,Everyone,Casual,2018-07-05,1.129.0.2,4.1
1869,Candy Crush Saga,game,4.4,22430188,100000000+,0.0,Everyone,Casual,2018-07-05,1.129.0.2,4.1
1966,Candy Crush Saga,game,4.4,22430188,100000000+,0.0,Everyone,Casual,2018-07-05,1.129.0.2,4.1
3994,Candy Crush Saga,family,4.4,22419455,100000000+,0.0,Everyone,Casual,2018-07-05,1.129.0.2,4.1


In [21]:
playstore[playstore['app'] == 'Duolingo: Learn Languages Free']

Unnamed: 0,app,category,rating,reviews,installs,price,content_rating,genres,last_updated,current_ver,android_ver
699,Duolingo: Learn Languages Free,education,4.7,6289924,100000000+,0.0,Everyone,Education;Education,2018-08-01,Varies with device,Varies with device
784,Duolingo: Learn Languages Free,education,4.7,6290507,100000000+,0.0,Everyone,Education;Education,2018-08-01,Varies with device,Varies with device
799,Duolingo: Learn Languages Free,education,4.7,6290507,100000000+,0.0,Everyone,Education;Education,2018-08-01,Varies with device,Varies with device
826,Duolingo: Learn Languages Free,education,4.7,6290507,100000000+,0.0,Everyone,Education;Education,2018-08-01,Varies with device,Varies with device
2056,Duolingo: Learn Languages Free,family,4.7,6294400,100000000+,0.0,Everyone,Education;Education,2018-08-01,Varies with device,Varies with device
2216,Duolingo: Learn Languages Free,family,4.7,6294397,100000000+,0.0,Everyone,Education;Education,2018-08-01,Varies with device,Varies with device
8439,Duolingo: Learn Languages Free,family,4.7,6297590,100000000+,0.0,Everyone,Education;Education,2018-08-06,Varies with device,Varies with device


There's no way to know which review number is correct, so I decided to keep the first row since the values don't differ too much. I considered keeping one app entry for each different category, but it could skew the dataset by having duplicate values for every other column.

In [22]:
playstore = playstore.drop_duplicates(subset = 'app')

### NaN values in the rating column

In [23]:
print('there are ' + str(pd.isna(playstore['rating']).sum()) + ' NaN values in the rating column')

there are 1463 NaN values in the rating column


15% of the rating column have empty values, which is a large portion of the data that would be lost if observations with NaN values were deleted. Instead, replace the empty values with the mean of all ratings so the data is at least not skewed too much.

In [24]:
playstore['rating'] = playstore['rating'].fillna(round(playstore['rating'].mean(), 2))

### What the data looks like

In [25]:
playstore.head()

Unnamed: 0,app,category,rating,reviews,installs,price,content_rating,genres,last_updated,current_ver,android_ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,art_and_design,4.1,159,10000+,0.0,Everyone,Art & Design,2018-01-07,1.0.0,4.0.3
1,Coloring book moana,art_and_design,3.9,967,100000+,0.0,Everyone,Art & Design;Pretend Play,2018-01-15,2.0.0,4.0.3
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",art_and_design,4.7,87510,1000000+,0.0,Everyone,Art & Design,2018-08-01,1.2.4,4.0.3
3,Sketch - Draw & Paint,art_and_design,4.5,215644,10000000+,0.0,Teen,Art & Design,2018-06-08,Varies with device,4.2
4,Pixel Draw - Number Art Coloring Book,art_and_design,4.3,967,100000+,0.0,Everyone,Art & Design;Creativity,2018-06-20,1.1,4.4


### Export playstore as a csv file

In [26]:
playstore.to_csv('playstore_data.csv', index = False)