This is a 72-hour challenge I did in late March 2022 for a data scientist position at an Indonesian multicompany, specifically at their media daughter company. The premise of the challenge is to recommend three to five apps from Google Playstore based on number of downloads, rating, and positive reviews. The constraint is the app must not be overly popular (e.g. Instagram, WhatsApp, Facebook).

<h1><center>Five Recommended Apps from Google Playstore</center></h1>

## INTRODUCTION

I believe that recommendation should be given based on outstading performance, and that includes apps. However, that may be difficult for newly launched apps with small number of users as the app lacks exposure towards wider audience and may underperform against Application Search Optimization (ASO). I decided to take up the task of recommending users of new apps worth trying using sentiment analysis in Python on thousands of app reviews from 2010 to 2018, and write an article about it.

The purpose of the task is to mutually benefit users and app developers. I hope to appeal to users through my data-backed recommendation that there are apps which are as functional as the more popular apps out there. With increased number of users, these apps will have more organic data for further development and generate revenue.

My recommendation will be based on a few KPI's of an app such as number of installs, reasonable number of positive reviews and average rating in the last thirty days.

##  DATA CLEANING

Detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

First, I need to import the required packages and dataset. In this project, I will be using `Numpy` for ... and `Pandas` for data processing and CSV file I/O. I will also be using two datasets, `app.csv` and `reviews.csv`.

In [1]:
# import pacakges
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# initialize view settings
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', True)

# import dataset    
playstore_apps = pd.read_csv('/kaggle/input/provided-data/playstore_apps.csv')
app_reviews = pd.read_csv('/kaggle/input/provided-data/app_reviews.csv')

`playstore_apps` contains simple information about apps &mdash; app name, genres, average rating, number of reviews, number of installs, when the app was last updated, and the minimum required version for the app to run. I can use these features as metrics of a well-performing app.

In [2]:
playstore_apps.head(3)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up


In [3]:
playstore_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


When cleaning a dataset, I always start by fixing the column names and its corresponding data type. Here, I am renaming the columns so that I could convey meanings without losing context.

In [4]:
# rename columns
playstore_apps.rename(columns={'App':'name',
                               'Category':'category',
                               'Rating':'avg_rating',
                               'Reviews':'num_reviews',
                               'Size':'size',
                               'Installs':'num_installs',
                               'Type':'type',
                               'Price':'price_usd',
                               'Content Rating':'content_rating',
                               'Genres':'genre',
                               'Last Updated':'last_updated',
                               'Current Ver':'app_ver',
                               'Android Ver':'min_android_ver'},
                      inplace=True)

This is an out of pocket observation, but I noticed this later on during data exploration phase. Observation #10472 is missing value for column `category`. Because of that, every other value has been shifted one column left its proper column. For example, column `category` is supposed to have a *string* value like `TOOLS`, but instead there is a *numeric* value of `1.9` where it would have made more sense to be under column `avg_rating`. I decided to drop that observation as it is not worth fixing.

In [5]:
playstore_apps.iloc[10472]

name               Life Made WI-Fi Touchscreen Photo Frame
category                                               1.9
avg_rating                                            19.0
num_reviews                                           3.0M
size                                                1,000+
num_installs                                          Free
type                                                     0
price_usd                                         Everyone
content_rating                                         NaN
genre                                    February 11, 2018
last_updated                                        1.0.19
app_ver                                         4.0 and up
min_android_ver                                        NaN
Name: 10472, dtype: object

In [6]:
playstore_apps.drop(10472, axis=0, inplace=True)

I also noticed that column `category` and `genre` have the exact same values, so it is redundant to have both columns. I decided to remove column `category`. 

Column `type` and `price` are also redundant. When column `type` has value `Free` then column `price` has a value `0`, and vice versa. Therefore, I can use just column `price` to convey wether an app is free or paid.

In [7]:
playstore_apps.drop(columns=['category','type'], inplace=True)

Before typecasting, I need to make sure that each *string* format is proper.

1. Column `size` where it implies how much space an app takes when installed in a phone. `k` should be `Kb` for Kilobyte, and `M` should be `Mb` for Megabyte.
1. Column `num_installs` where it implies number of times an app has been installed. Each *string* ends with a plus sign, which would make the data type well, *string*. It would be more logical to have its values as *numeric* by removing the plus sign, and type cast it later.
1. Column `price_usd` needs to be numeric, so the dollar sign has got to go before type casting.
1. Column `genre` has spaces in between like in `Art & Design`. The space has to be rid off because I planned to list multiple genres into a list. So, I had to make sure that one category is in one continuous string.
1. Column `min_android_ver` does not need ` and up` at the tail of each values.
1. Column `content_rating` is adjusted according to Google Playstore's actual rating standard for North and South America. `Mature 17+` is listed as `Mature`, and `Mature 18+` as `Adult`. I noticed since the app prices were listed in USD, then the data must been retrieved from the United States.

In [8]:
playstore_apps['size'] = playstore_apps['size'].str.replace('k', 'Kb').str.replace('M', 'Mb')
playstore_apps['num_installs'] = playstore_apps['num_installs'].str.replace('+', '', regex=True).str.replace(',', '', regex=True)
playstore_apps['price_usd'] = playstore_apps['price_usd'].str.replace('$', '', regex=True)
playstore_apps['genre'] = playstore_apps['genre'].str.replace(' ', '', regex=True)
playstore_apps['min_android_ver'] = playstore_apps['min_android_ver'].str.replace(' and up', '')
playstore_apps['content_rating'] = playstore_apps['content_rating'].str.replace(' 17\+', '', regex=True).replace(' only 18\+', '', regex=True)

When the data has been partially cleaned on string format, I remove duplicates.

In [9]:
playstore_apps.drop_duplicates(inplace=True)

Column `genre` has its special "typecasting." I wanted each genre to have its own "space."`Art&Design; Auto&Vehicle` is considered as one genre, where logically should be two genres, not four or one. It would only be possible by getting rid off the whitespace in each category (which I did) and making sure that each *string* is saved as is, which would be possible in a list.

Using string comprehension, I can create a list using the method below.

In [10]:
playstore_apps['genre'] = [char.split(';') for char in playstore_apps['genre']]

Now to take care of the remaining columns. `num_reviews`, `num_installs` and `price_usd` need to be *numeric*. `app_ver` and `min_android_ver` need to be *strings*.


In [11]:
playstore_apps['num_reviews'] = pd.to_numeric(playstore_apps['num_reviews'], errors='coerce')
playstore_apps['num_installs'] = pd.to_numeric(playstore_apps['num_installs'], errors='coerce')
playstore_apps['price_usd'] = pd.to_numeric(playstore_apps['price_usd'], errors='coerce')
playstore_apps['app_ver'] = playstore_apps['app_ver'].astype(str)
playstore_apps['min_android_ver'] = playstore_apps['min_android_ver'].astype(str)

Finally, I drop NaN observations and sort the dataset alphabetically based on `app_name`.

In [12]:
playstore_apps.dropna(inplace=True)
playstore_apps.sort_values(by=['name'], inplace=True)

To ensure that I do not have to repeat the data cleaning process, I export the cleaned data as JSON files. I exported a csv file in case I wanted to explore the data again.

In [13]:
playstore_apps.to_json('cleaned_playstore_apps.json')
playstore_apps.to_csv('cleaned_playstore_apps.csv', index=False)

## DATA EXPLORATION

Initial data analysis, whereby a data analyst uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems.

## FEATURE ENGINEERING

Using domain knowledge to extract features from raw data.

## SENTIMENT ANALYSIS

Using natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. In this case, I will perform "simple" sentiment analysis.

## CONCLUSION

## DOCUMENTATION