# Playstore Apps analysis & Visualization

### About the project:
In this project, we will be working on a real-world dataset of the google play store, one of the most used applications for downloading android apps. This project aims on cleaning the dataset, analyze the given dataset, and mining informational quality insights. This project also involves visualizing the data to better and easily understand trends and different categories.

### Tasks performed in this notebook:

1. Raw Data Importing
2. Raw Data Cleaning (& analyzing)
3. Cleaned Data Exporting 


##### We have performed these tasks for both of the datsets i.e. playstore_apps.csv and playstore_reviewss.csv 

In [1]:
# importing library

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [2]:
# loading playstore_apps dataset

apps = pd.read_csv('playstore_apps.csv')

In [3]:
# loading playstore_reviews dataset

reviews = pd.read_csv('playstore_reviews.csv')

## Apps Dataset

In [4]:
# Looking at top 5 rows of apps dataset

apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,19M,10000.0,Free,0.0,Everyone,Art & Design,07-01-2018,1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967.0,14M,500000.0,Free,0.0,Everyone,Art & Design;Pretend Play,15-01-2018,2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510.0,8.7M,5000000.0,Free,0.0,Everyone,Art & Design,01-08-2018,1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644.0,25M,50000000.0,Free,0.0,Teen,Art & Design,08-06-2018,Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967.0,2.8M,100000.0,Free,0.0,Everyone,Art & Design;Creativity,20-06-2018,1.1,4.4 and up


In [5]:
# Looking at information of apps dataset

apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10840 non-null  float64
 4   Size            10841 non-null  object 
 5   Installs        10840 non-null  float64
 6   Type            10840 non-null  object 
 7   Price           10840 non-null  float64
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10840 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(4), object(9)
memory usage: 1.1+ MB


In [6]:
# Looking at the dimension

apps.shape

(10841, 13)

In [7]:
# Looking for duplicate records

apps.duplicated().sum()

483

In [8]:
# dropping all duplicates, but keeping the first one

apps.drop_duplicates(keep="first", inplace=True)

In [9]:
# cross-checking duplicate records

apps.duplicated().sum()

0

In [10]:
# Looking at new dimension
apps.shape

(10358, 13)

In [11]:
# looking for number of null values in each columns

apps.isnull().sum()

App                  0
Category             0
Rating            1465
Reviews              1
Size                 0
Installs             1
Type                 1
Price                1
Content Rating       1
Genres               0
Last Updated         1
Current Ver          8
Android Ver          3
dtype: int64

In [12]:
# looking for number of unique values in each columns

apps.nunique()

App               9660
Category            34
Rating              40
Reviews           6001
Size               462
Installs            20
Type                 3
Price               92
Content Rating       6
Genres             120
Last Updated      1377
Current Ver       2784
Android Ver         33
dtype: int64

**App Column**

App column has only 9660 unique values out of 10358 records. So, we need to drop these duplicates. That means we have 698 duplicates.

In [13]:
# Transforming the App column to a proper format

apps['App'] = apps['App'].str.title()

In [14]:
# Lets check for the number of unique values present in the app column after giving a proper formatting to App names.

len(apps['App'].unique())

9639

In [15]:
# Lets check for the number of duplicates present in the app column after giving a proper formatting to App names.

apps['App'].duplicated().sum()

719

In [16]:
# Just overviewing the duplicates

apps[apps['App'].duplicated()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
285,Quick Pdf Scanner + Ocr Free,BUSINESS,4.2,80804.0,Varies with device,5.000000e+06,Free,0.0,Everyone,Business,26-02-2018,Varies with device,4.0.3 and up
293,Officesuite : Free Office + Pdf Editor,BUSINESS,4.3,1002859.0,35M,1.000000e+08,Free,0.0,Everyone,Business,02-08-2018,9.7.14188,4.1 and up
294,Slack,BUSINESS,4.4,51510.0,Varies with device,5.000000e+06,Free,0.0,Everyone,Business,02-08-2018,Varies with device,Varies with device
382,Messenger – Text And Video Chat For Free,COMMUNICATION,4.0,56646578.0,Varies with device,1.000000e+09,Free,0.0,Everyone,Communication,01-08-2018,Varies with device,Varies with device
383,Imo Free Video Calls And Chat,COMMUNICATION,4.3,4785988.0,11M,5.000000e+08,Free,0.0,Everyone,Communication,08-06-2018,9.8.000000010501,4.0 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10647,Motorola Fm Radio,VIDEO_PLAYERS,3.9,54815.0,Varies with device,1.000000e+08,Free,0.0,Everyone,Video Players & Editors,02-05-2018,Varies with device,Varies with device
10715,Farmersonly Dating,DATING,3.0,1145.0,1.4M,1.000000e+05,Free,0.0,Mature 17+,Dating,25-02-2016,2.2,4.0 and up
10720,Firefox Focus: The Privacy Browser,COMMUNICATION,4.4,36981.0,4.0M,1.000000e+06,Free,0.0,Everyone,Communication,06-07-2018,5.2,5.0 and up
10730,Fp Notebook,MEDICAL,4.5,410.0,60M,5.000000e+04,Free,0.0,Everyone,Medical,24-03-2018,2.1.0.372,4.4 and up


In [17]:
# Looking for number of values for each unique value in the app column

apps['App'].value_counts()

Roblox                           9
8 Ball Pool                      7
Helix Jump                       6
Bubble Shooter                   6
Zombie Catchers                  6
                                ..
Cf-Bench                         1
Easynote Notepad | To Do List    1
Travelpirates                    1
Creative Destruction             1
Iobd2-Cf                         1
Name: App, Length: 9639, dtype: int64

In [18]:
# Selecting the App repeating most frequently for analysis purpose

apps[apps['App'] == "Roblox"]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1653,Roblox,GAME,4.5,4447388.0,67M,100000000.0,Free,0.0,Everyone 10+,Adventure;Action & Adventure,31-07-2018,2.347.225742,4.1 and up
1701,Roblox,GAME,4.5,4447346.0,67M,100000000.0,Free,0.0,Everyone 10+,Adventure;Action & Adventure,31-07-2018,2.347.225742,4.1 and up
1748,Roblox,GAME,4.5,4448791.0,67M,100000000.0,Free,0.0,Everyone 10+,Adventure;Action & Adventure,31-07-2018,2.347.225742,4.1 and up
1841,Roblox,GAME,4.5,4449882.0,67M,100000000.0,Free,0.0,Everyone 10+,Adventure;Action & Adventure,31-07-2018,2.347.225742,4.1 and up
1870,Roblox,GAME,4.5,4449910.0,67M,100000000.0,Free,0.0,Everyone 10+,Adventure;Action & Adventure,31-07-2018,2.347.225742,4.1 and up
2016,Roblox,FAMILY,4.5,4449910.0,67M,100000000.0,Free,0.0,Everyone 10+,Adventure;Action & Adventure,31-07-2018,2.347.225742,4.1 and up
2088,Roblox,FAMILY,4.5,4450855.0,67M,100000000.0,Free,0.0,Everyone 10+,Adventure;Action & Adventure,31-07-2018,2.347.225742,4.1 and up
2206,Roblox,FAMILY,4.5,4450890.0,67M,100000000.0,Free,0.0,Everyone 10+,Adventure;Action & Adventure,31-07-2018,2.347.225742,4.1 and up
4527,Roblox,FAMILY,4.5,4443407.0,67M,100000000.0,Free,0.0,Everyone 10+,Adventure;Action & Adventure,31-07-2018,2.347.225742,4.1 and up


###### Observation:

Here we can see that records having 'Roblox' as App name differ only in terms of Category and Number of Reviews.
So, we are going to select only that record which has the highest number of reviews. The problem occurs because of the scrapping technique. Scrapping has been done at different point of time because of which the values of different columns for the same app name are different.

###### Possible solution:

In [19]:
# changing datatype of Last Updated from object--->datetime format so that...
# ...we can sort duplicates with respect to Last Updated

apps['Last Updated'] = apps['Last Updated'].astype('datetime64[ns]')

In [20]:
# Sorting records with respect to the parameters provided in the list below...
# ...so that we can keep the first record on the basis of applied filters

apps.sort_values(by = ['App', 'Reviews', 'Installs', 'Last Updated' ], ascending = False, inplace = True)

In [21]:
# Dropping the duplicate App names without dropping the first record

apps.drop_duplicates('App', keep = 'first', inplace = True)

In [22]:
# Cross checking whether the duplicate app names have been removed or not.

apps['App'].duplicated().sum()

0

**Category Column**

In [23]:
# inspecting category column

apps['Category'].unique()

array(['SHOPPING', 'MAPS_AND_NAVIGATION', 'NEWS_AND_MAGAZINES',
       'FOOD_AND_DRINK', 'GAME', 'FAMILY', 'COMMUNICATION',
       'PERSONALIZATION', 'TOOLS', 'SPORTS', 'HOUSE_AND_HOME',
       'AUTO_AND_VEHICLES', 'DATING', 'LIFESTYLE', 'BUSINESS',
       'PARENTING', 'HEALTH_AND_FITNESS', 'MEDICAL', 'FINANCE',
       'ENTERTAINMENT', 'SOCIAL', 'TRAVEL_AND_LOCAL', 'PHOTOGRAPHY',
       'VIDEO_PLAYERS', 'LIBRARIES_AND_DEMO', 'BOOKS_AND_REFERENCE',
       'WEATHER', 'PRODUCTIVITY', 'EVENTS', 'ART_AND_DESIGN', 'BEAUTY',
       'COMICS', 'EDUCATION', '1.9'], dtype=object)

###### Observation:

Looking at the above output, we can see that '1.9' is not of categorical datatype. So let's inspect into it. 

In [24]:
# Looking at the record with category as '1.9'

apps[apps['Category']=='1.9']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made Wi-Fi Touchscreen Photo Frame,1.9,19.0,,"1,000+",,0,,,"February 11, 2018",NaT,4.0 and up,


###### Observation:

We can see that the data starting from the category column till the Android Version has data from the immediate column to its right. But even if we try to format the data by shifting the values, most of the columns would have NaN values. Hence dropping the entire row would be a better option.

In [25]:
# dropping the record

apps.drop(10472,inplace=True)

In [26]:
# Cross checking for unique values in Category column

apps['Category'].unique()

array(['SHOPPING', 'MAPS_AND_NAVIGATION', 'NEWS_AND_MAGAZINES',
       'FOOD_AND_DRINK', 'GAME', 'FAMILY', 'COMMUNICATION',
       'PERSONALIZATION', 'TOOLS', 'SPORTS', 'HOUSE_AND_HOME',
       'AUTO_AND_VEHICLES', 'DATING', 'LIFESTYLE', 'BUSINESS',
       'PARENTING', 'HEALTH_AND_FITNESS', 'MEDICAL', 'FINANCE',
       'ENTERTAINMENT', 'SOCIAL', 'TRAVEL_AND_LOCAL', 'PHOTOGRAPHY',
       'VIDEO_PLAYERS', 'LIBRARIES_AND_DEMO', 'BOOKS_AND_REFERENCE',
       'WEATHER', 'PRODUCTIVITY', 'EVENTS', 'ART_AND_DESIGN', 'BEAUTY',
       'COMICS', 'EDUCATION'], dtype=object)

In [27]:
# Checking the number of null values present in Category Column

apps['Category'].isnull().sum()

0

**Rating column**

In [28]:
# Checking for the number of null values in the Rating column

apps['Rating'].isnull().sum()

1458

In [29]:
# Capturing the columns with null values in a list

features_with_nan=[i for i in apps.columns if apps[i].isnull().sum()>=1]
features_with_nan

['Rating', 'Type', 'Current Ver', 'Android Ver']

In [30]:
# Checking the percentage of null values present in each columns with null values

for feature in features_with_nan:
    print(feature," has", np.round(apps[feature].isnull().mean()*100, 2), '% missing values')

Rating  has 15.13 % missing values
Type  has 0.01 % missing values
Current Ver  has 0.08 % missing values
Android Ver  has 0.02 % missing values


Here we can see that Rating has 15.13 % of its total data as null. 15.13% is a significant number and such a huge number of missing values can influence our analysis greatly as we keep progressing. Therefore, either we should impute the values or we should drop them.

But we can not directly conclude neither to drop the null values nor to impute ithem. 

In real world scenarios, even big companies struggle to gather enough amount of data for analysis purposes. In our case, we may lose some crucial informations from our dataset. So what should we do? Should we drop or not? 

In real world screnarios, the distributions of the datset are generally skewed. But if we find too much skewness in our dataset, then our statistical models won’t work effectively.

In skewed data, the tail region (that can be seen in the distribution diagram) may act as an outlier for our statistical model, and we know that outliers adversely affect a model’s performance, especially regression-based models. While there are statistical models that are robust enough to handle outliers like tree-based models, we’ll be limited in what other models we can try. So what should we do? In this case, we’ll need to transform the skewed data so that it becomes a Gaussian (or normal) distribution. Removing outliers and normalizing our data will allow us to experiment with more statistical models.

Overall, its a trial and error process. It's important to realize that data normalization isn't always necessary. In fact, sometimes it makes sense to do the opposite and add redundancy to a database. We should try out both conditions and check if our statistical model is performing better in first condition or in second condition. And finally, we should opt the process with a better performance over the another.

In [31]:
# Imputing Nan values with 0 

# (however dataset distribution is better if we repalce by mean) - cross_verified through sns.distplot

apps['Rating'] = apps['Rating'].fillna(0)

In [32]:
# Checking the number of unique values present in Rating column

apps['Rating'].unique()

array([3.9, 4.2, 4.9, 4.6, 3.8, 0. , 2.9, 4.4, 4.5, 4. , 3.2, 3.6, 4.1,
       4.3, 3.7, 4.7, 3.5, 5. , 4.8, 3.1, 2.8, 1.9, 3.4, 3.3, 2.7, 1.8,
       2.4, 2.3, 2.5, 3. , 2. , 2.6, 1. , 2.2, 2.1, 1.4, 1.6, 1.5, 1.2,
       1.7])

**Reviews column**

In [33]:
# Checking the datatype of Reviews Column

apps['Reviews'].dtypes

dtype('float64')

In [34]:
# Converting the datatype into int32

apps['Reviews'] = apps['Reviews'].astype("int32")

In [35]:
# Cross checking if the datatype is changed or not:

apps['Reviews'].dtypes

dtype('int32')

In [36]:
# Checking the unique values

apps['Reviews'].unique()

array([  117, 12572,    58, ...,  2490,  2019, 40467])

In [37]:
# Having a glimpse on Reviews column

apps['Reviews'].describe()

count    9.638000e+03
mean     2.167621e+05
std      1.832769e+06
min      0.000000e+00
25%      2.500000e+01
50%      9.785000e+02
75%      2.953125e+04
max      7.815831e+07
Name: Reviews, dtype: float64

**Size column**

In [38]:
# Checking the unique values:

apps['Size'].unique()

array(['9.8M', '24M', '3.8M', '1.7M', '36M', '549k', '13M', '11M', '6.7M',
       '19M', '118k', '72k', '2.5M', '5.6M', '21M', '52M', '25M',
       'Varies with device', '5.9M', '37M', '4.6M', '8.2M', '20M', '62M',
       '23M', '94M', '66M', '50M', '41M', '45M', '75M', '96M', '12M',
       '38M', '32M', '3.5M', '15M', '34M', '78M', '544k', '84M', '7.2M',
       '7.1M', '18M', '14M', '6.1M', '33M', '61M', '9.0M', '3.0M', '72M',
       '3.6M', '5.7M', '67M', '98M', '59M', '47M', '7.7M', '27M', '17M',
       '7.5M', '5.3M', '69M', '85M', '2.4M', '83M', '86M', '44M', '5.1M',
       '26M', '6.3M', '4.7M', '22M', '16M', '1.2M', '3.2M', '6.0M', '99M',
       '3.7M', '404k', '8.7M', '95M', '46M', '6.4M', '64M', '4.4M',
       '8.6M', '2.0M', '1.9M', '8.5M', '77M', '28M', '4.9M', '51M', '55M',
       '3.1M', '49M', '81M', '76M', '42M', '2.6M', '87M', '1020k', '1.5M',
       '30M', '53M', '29M', '5.0M', '3.3M', '4.1M', '1.0M', '56M', '887k',
       '6.8M', '1.3M', '10.0M', '41k', '1.8M', '903k'

###### Observation:

1. We can see a lot of values ending with unwanted characters. So we need to recognise and remove them. 

2. We can see that the size of the app is given in MBs and KBs in terms of M and K respectively.Hence we need to standardize the data by converting MB into KB.

3. There are many values as "Varies with device". So we need to replace them by zero in order to convert the whole Size column into numerical datatype (since the column doesn't already have 0 as a value).

In [39]:
# Capturing the last character from each values into a list

l=[]
for i in apps["Size"][:]:
    if i[-1] not in l:
        l.append(i[-1])
l

['M', 'k', 'e']

In [40]:
# Storing the indexes of the values ending with 'M' in store1.

store1 = apps[apps["Size"].str[-1]=="M"].index
store1

Int64Index([ 8133,  3859,  9828,  9819,  5844,  4653,  4453,  6648,  4288,
             4456,
            ...
             8219,  7738,  8483,  1393, 10252,  5940,  4636,  8532,   324,
             8884],
           dtype='int64', length=8098)

In [41]:
# Removing the last character from the size values captured in stored1

for i in store1:
    apps["Size"].loc[i] = apps["Size"].loc[i][:-1]

In [42]:
# Cross checking if we have successfully removed 'M':

apps['Size'].unique()

array(['9.8', '24', '3.8', '1.7', '36', '549k', '13', '11', '6.7', '19',
       '118k', '72k', '2.5', '5.6', '21', '52', '25',
       'Varies with device', '5.9', '37', '4.6', '8.2', '20', '62', '23',
       '94', '66', '50', '41', '45', '75', '96', '12', '38', '32', '3.5',
       '15', '34', '78', '544k', '84', '7.2', '7.1', '18', '14', '6.1',
       '33', '61', '9.0', '3.0', '72', '3.6', '5.7', '67', '98', '59',
       '47', '7.7', '27', '17', '7.5', '5.3', '69', '85', '2.4', '83',
       '86', '44', '5.1', '26', '6.3', '4.7', '22', '16', '1.2', '3.2',
       '6.0', '99', '3.7', '404k', '8.7', '95', '46', '6.4', '64', '4.4',
       '8.6', '2.0', '1.9', '8.5', '77', '28', '4.9', '51', '55', '3.1',
       '49', '81', '76', '42', '2.6', '87', '1020k', '1.5', '30', '53',
       '29', '5.0', '3.3', '4.1', '1.0', '56', '887k', '6.8', '1.3',
       '10.0', '41k', '1.8', '903k', '61k', '29k', '26k', '1.4', '5.8',
       '253k', '4.2', '9.9', '73', '6.6', '57', '7.8', '9.4', '10', '6.5',
    

In [43]:
# Storing the indexes of the values ending with 'k' in store1.

store2 = apps[apps["Size"].str[-1] == "k"].index
store2

Int64Index([ 5832,  4761, 10455,  3346, 10471, 10798,  6602,  1542,  8946,
             5451,
            ...
            10041, 10051,  4977,  4973,  4953,  4970,  4871,  6671,  4897,
             4541],
           dtype='int64', length=314)

In [44]:
# Removing the last character from the size values captured in stored2...
# ...and simultaneously converting values from KB to MB


for i in store2:
    apps["Size"].loc[i] = round(float(apps["Size"].loc[i][:-1])/1024,2)

In [45]:
# Cross checking if we have successfully removed 'k':


apps['Size'].unique()

array(['9.8', '24', '3.8', '1.7', '36', 0.54, '13', '11', '6.7', '19',
       0.12, 0.07, '2.5', '5.6', '21', '52', '25', 'Varies with device',
       '5.9', '37', '4.6', '8.2', '20', '62', '23', '94', '66', '50',
       '41', '45', '75', '96', '12', '38', '32', '3.5', '15', '34', '78',
       0.53, '84', '7.2', '7.1', '18', '14', '6.1', '33', '61', '9.0',
       '3.0', '72', '3.6', '5.7', '67', '98', '59', '47', '7.7', '27',
       '17', '7.5', '5.3', '69', '85', '2.4', '83', '86', '44', '5.1',
       '26', '6.3', '4.7', '22', '16', '1.2', '3.2', '6.0', '99', '3.7',
       0.39, '8.7', '95', '46', '6.4', '64', '4.4', '8.6', '2.0', '1.9',
       '8.5', '77', '28', '4.9', '51', '55', '3.1', '49', '81', '76',
       '42', '2.6', '87', 1.0, '1.5', '30', '53', '29', '5.0', '3.3',
       '4.1', '1.0', '56', 0.87, '6.8', '1.3', '10.0', 0.04, '1.8', 0.88,
       0.06, 0.03, '1.4', '5.8', 0.25, '4.2', '9.9', '73', '6.6', '57',
       '7.8', '9.4', '10', '6.5', 0.64, '9.2', '9.1', '58', '9.7', 

In [46]:
# Capturing string length for 'Varies with device' in a variable:

str_length = len("Varies with device")
str_length

18

In [47]:
# Storing the indexes of the values having "Varies with device" in the variable store3

store3 = apps[apps["Size"].str[-str_length:] == "Varies with device"].index
store3

Int64Index([2758, 6245, 7077, 1641,   89, 1370, 1914, 1255, 3256, 3253,
            ...
            1437, 4863, 7006, 4568, 1267, 4875, 3151, 3448, 7330, 7338],
           dtype='int64', length=1226)

In [48]:
# Replacing "Varies with device" with 0

for i in store3:
    apps["Size"].loc[i] = 0

In [49]:
# Cross checking if we have successfully imputed the values or not:

apps['Size'].unique()

array(['9.8', '24', '3.8', '1.7', '36', 0.54, '13', '11', '6.7', '19',
       0.12, 0.07, '2.5', '5.6', '21', '52', '25', 0, '5.9', '37', '4.6',
       '8.2', '20', '62', '23', '94', '66', '50', '41', '45', '75', '96',
       '12', '38', '32', '3.5', '15', '34', '78', 0.53, '84', '7.2',
       '7.1', '18', '14', '6.1', '33', '61', '9.0', '3.0', '72', '3.6',
       '5.7', '67', '98', '59', '47', '7.7', '27', '17', '7.5', '5.3',
       '69', '85', '2.4', '83', '86', '44', '5.1', '26', '6.3', '4.7',
       '22', '16', '1.2', '3.2', '6.0', '99', '3.7', 0.39, '8.7', '95',
       '46', '6.4', '64', '4.4', '8.6', '2.0', '1.9', '8.5', '77', '28',
       '4.9', '51', '55', '3.1', '49', '81', '76', '42', '2.6', '87', 1.0,
       '1.5', '30', '53', '29', '5.0', '3.3', '4.1', '1.0', '56', 0.87,
       '6.8', '1.3', '10.0', 0.04, '1.8', 0.88, 0.06, 0.03, '1.4', '5.8',
       0.25, '4.2', '9.9', '73', '6.6', '57', '7.8', '9.4', '10', '6.5',
       0.64, '9.2', '9.1', '58', '9.7', '4.3', '2.3', '4.5'

In [50]:
# Checking the datatype of the Size column:

apps['Size'].dtypes

dtype('O')

In [51]:
# Converting the datatype of the Size column:

apps['Size'] = apps['Size'].astype("float32")

In [52]:
# Cross checking the datatype of the Size column:

apps['Size'].dtypes

dtype('float32')

**Installs column**

In [53]:
#looking for null values:

apps['Installs'].isnull().sum()

0

In [54]:
# Looking at unique values:

apps['Installs'].unique()

array([1.e+04, 1.e+06, 5.e+02, 1.e+00, 1.e+03, 5.e+03, 1.e+02, 5.e+05,
       5.e+07, 5.e+04, 1.e+07, 1.e+05, 1.e+08, 5.e+06, 1.e+01, 1.e+09,
       5.e+01, 5.e+00, 5.e+08, 0.e+00])

In [55]:
# Checking the datatype:

apps['Installs'].dtypes

dtype('float64')

In [56]:
# Converting float64 into int32:

apps['Installs'] = apps['Installs'].astype("int32")

In [57]:
# Cross checking if we have successfully changed the datatype:

apps['Installs'].dtypes

dtype('int32')

**Type column**

In [58]:
# looking for null values:

apps['Type'].isnull().sum()

1

In [59]:
# Looking for that particular record with null value in Type column

apps[apps['Type'].isnull() == True]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
9148,Command & Conquer: Rivals,FAMILY,0.0,0,0.0,0,,0.0,Everyone 10+,Strategy,2018-06-28,Varies with device,Varies with device


###### Observation:
- This particular record is not much useful as apart from NaN in Type column the other columns like Reviews, Size, Installs doesn't have appropriate values which could be useful for future Analysis.

In [60]:
# Dropping the record

apps = apps[~apps['Type'].isnull()]

In [61]:
# Cross - checking for null value

apps['Type'].isnull().sum()

0

In [62]:
# checking unique values

apps['Type'].unique()

array(['Free', 'Paid'], dtype=object)

In [63]:
# Looking at counts of sub-type of Type column

apps['Type'].value_counts()

Free    8886
Paid     751
Name: Type, dtype: int64

**Price Column**

In [64]:
# looking for null values

apps['Price'].isnull().sum()

0

In [65]:
# Looking at unique values

apps['Price'].unique()

array([  0.  ,   2.99,   1.99,   1.49,   0.99,  12.99,   2.5 ,  19.99,
         2.56,   9.99,   4.49,   1.04,   3.99,   2.9 ,   1.  ,   5.99,
         2.49,  79.99,   8.99,   4.99,  16.99,   1.97,   3.49,  17.99,
         6.99,   6.49,   1.2 ,   4.85,   2.95,   4.59,  10.99,   7.49,
         4.84,   1.76,  29.99,   7.99,   5.49,  10.  ,   4.6 ,   3.02,
        14.99,  39.99,   1.7 ,  15.99,   4.29,  24.99, 399.99,   9.  ,
         8.49,   1.59,   1.61,  89.99,  74.99,  15.46,   1.26, 400.  ,
       299.99,  18.99,  37.99, 379.99,  25.99,   3.88,  13.99,   2.  ,
       394.99,   3.61,  11.99, 200.  ,   4.77,  28.99,  46.99,   3.28,
       154.99,   1.96,   3.95,   4.8 , 109.99,  19.4 ,  14.  ,   1.75,
         2.59,  19.9 ,   3.9 ,   1.5 ,   5.  ,   1.29,   3.04,   2.6 ,
        33.99,   3.08])

In [66]:
# Looking at the description of the Price column

apps['Price'].describe()

count    9637.000000
mean        1.014546
std        15.884451
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max       400.000000
Name: Price, dtype: float64

In [67]:
# Checking the datatype

apps['Price'].dtypes

dtype('float64')

In [68]:
# Converting the datatype into 'float32'

apps['Price'] = apps['Price'].astype("float32")

In [69]:
# Cross -checking the datatype

apps['Price'].dtypes

dtype('float32')

**Content Rating Column**

In [70]:
# looking for null values

apps['Content Rating'].isnull().sum()

0

In [71]:
# Looking at unique values

apps['Content Rating'].unique()

array(['Everyone', 'Everyone 10+', 'Teen', 'Mature 17+',
       'Adults only 18+', 'Unrated'], dtype=object)

In [72]:
# Looking at counts of sub-type of Content Rating column

apps['Content Rating'].value_counts()

Everyone           7883
Teen               1035
Mature 17+          393
Everyone 10+        321
Adults only 18+       3
Unrated               2
Name: Content Rating, dtype: int64

**Genres column**

In [73]:
# Looking for null values

apps['Genres'].isnull().sum()

0

In [74]:
# Looking at unique values

apps['Genres'].unique()

array(['Shopping', 'Maps & Navigation', 'News & Magazines',
       'Food & Drink', 'Arcade', 'Education', 'Communication',
       'Personalization', 'Tools', 'Sports', 'Entertainment', 'Casino',
       'House & Home', 'Auto & Vehicles', 'Education;Education', 'Dating',
       'Lifestyle', 'Business', 'Puzzle', 'Parenting;Education',
       'Simulation', 'Health & Fitness', 'Action', 'Strategy', 'Medical',
       'Finance', 'Social', 'Educational;Education', 'Travel & Local',
       'Photography', 'Video Players & Editors',
       'Entertainment;Music & Video', 'Trivia', 'Libraries & Demo',
       'Books & Reference', 'Casual', 'Weather', 'Productivity',
       'Racing;Action & Adventure', 'Events', 'Adventure', 'Art & Design',
       'Books & Reference;Creativity', 'Beauty',
       'Puzzle;Action & Adventure', 'Card', 'Board;Pretend Play', 'Word',
       'Puzzle;Brain Games', 'Role Playing', 'Board;Action & Adventure',
       'Parenting', 'Racing', 'Arcade;Action & Adventure', 'Comics'

###### Observation:
- We can observe that some of the values have two genres separated with `';'` (semicolon). So for further analysis, we will separate the them into two columns named as `Genre_1` and `Genre_2` and we will remove original Genres column.

In [75]:
# Storing the values of genre before ';' in Genre 1

Genre_1 = apps['Genres'].apply(lambda value : value.split(';')[0])


# Storing the values of genre after ';' in Genre 2
# Places where Genre_2 values are not present, substituting them with 'Not Applicable'

Genre_2 = apps['Genres'].apply(lambda value : value.split(';')[1] if len(value.split(';')) > 1 else "Not Applicable")

In [76]:
# Inserting the Genre_1 and Genre_2 columns at column index 10 & 11 respectively

apps.insert(10, 'Genre_1', Genre_1)
apps.insert(11, 'Genre_2', Genre_2)

In [77]:
# Looking at the dimension of apps dataset after addition of new columns

apps.shape

(9637, 15)

In [78]:
# Capturing the values of those indexes where Genre_1 and Genre_2 are same

apps_index = apps[apps['Genre_1'] == apps['Genre_2']].index

In [79]:
# Replacing the values in Genre_2 column with 'Not Applicable' where we found that Genre_2 is the same value as in Genre_1

for i in apps_index:
    apps['Genre_2'][i] = "Not Applicable"

In [80]:
# Dropping the original Genre column and save it separtely

Original_Genres = apps.pop('Genres')

In [81]:
# Looking at the dimension of apps dataset

apps.shape

(9637, 14)

In [82]:
# Looking at apps dataset

apps

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genre_1,Genre_2,Last Updated,Current Ver,Android Ver
8133,Šmelina .Cz Inzeráty Inzerce,SHOPPING,3.9,117,9.8,10000,Free,0.00,Everyone,Shopping,Not Applicable,2018-05-13,1.3,4.0 and up
3859,Öbb Scotty,MAPS_AND_NAVIGATION,4.2,12572,24.0,1000000,Free,0.00,Everyone,Maps & Navigation,Not Applicable,2018-02-19,5.4 (30),4.0 and up
9828,Égalité Et Réconciliation,NEWS_AND_MAGAZINES,4.9,58,3.8,500,Paid,2.99,Everyone,News & Magazines,Not Applicable,2018-05-26,1.1.1,5.0 and up
9819,¿Es Vegan?,FOOD_AND_DRINK,4.6,438,1.7,10000,Free,0.00,Everyone,Food & Drink,Not Applicable,2017-01-08,2.2.3,3.0 and up
5844,¡Ay Metro!,GAME,3.8,489,36.0,10000,Free,0.00,Everyone 10+,Arcade,Not Applicable,2015-03-17,1.0.3.1,4.0 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4636,/U/App,COMMUNICATION,4.7,573,53.0,10000,Free,0.00,Mature 17+,Communication,Not Applicable,2018-03-07,4.2.4,4.1 and up
4541,.R,TOOLS,4.5,259,0.2,10000,Free,0.00,Everyone,Tools,Not Applicable,2014-09-16,1.1.06,1.5 and up
8532,+Download 4 Instagram Twitter,SOCIAL,4.5,40467,22.0,1000000,Free,0.00,Everyone,Social,Not Applicable,2018-02-08,5.03,4.1 and up
324,#Name?,COMICS,3.5,115,9.1,10000,Free,0.00,Mature 17+,Comics,Not Applicable,2018-07-13,5.0.12,5.0 and up


In [83]:
# Checking information of apps dataset

apps.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9637 entries, 8133 to 8884
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   App             9637 non-null   object        
 1   Category        9637 non-null   object        
 2   Rating          9637 non-null   float64       
 3   Reviews         9637 non-null   int32         
 4   Size            9637 non-null   float32       
 5   Installs        9637 non-null   int32         
 6   Type            9637 non-null   object        
 7   Price           9637 non-null   float32       
 8   Content Rating  9637 non-null   object        
 9   Genre_1         9637 non-null   object        
 10  Genre_2         9637 non-null   object        
 11  Last Updated    9637 non-null   datetime64[ns]
 12  Current Ver     9629 non-null   object        
 13  Android Ver     9635 non-null   object        
dtypes: datetime64[ns](1), float32(2), float64(1), int32(2

**Last Updated Column**

In [84]:
# Looking for null values

apps['Last Updated'].isnull().sum()

0

In [85]:
# Looking at datatype

apps['Last Updated'].dtypes

dtype('<M8[ns]')

**Current Ver Column**

In [86]:
# Looking for null values

apps['Current Ver'].isnull().sum()

8

In [87]:
# Replacing the null values with "Unknown" as its is a categorical column.

apps['Current Ver'].fillna("Unknown", inplace= True)

In [88]:
# Checking null values

apps['Current Ver'].isnull().sum()

0

In [89]:
# Looking at unique values

apps['Current Ver'].unique()

array(['1.3', '5.4 (30)', '1.1.1', ..., '1.1.06', '5.03', '0.22'],
      dtype=object)

In [90]:
# Looking at the count of sub-type of Current Ver column

apps['Current Ver'].value_counts()

Varies with device    1052
1                      822
1.1                    270
1.2                    183
2                      163
                      ... 
1.3.6.4                  1
1.29.15                  1
04.00.40                 1
1.9.22                   1
9.8.000000010492         1
Name: Current Ver, Length: 2771, dtype: int64

**Android Ver Column**

In [91]:
# Looking for null values

apps['Android Ver'].isnull().sum()

2

In [92]:
# Replacing the null values with "Unknown" as its is a categorical column.

apps['Android Ver'].fillna("Unknown", inplace= True)

In [93]:
# Checking for null values

apps['Android Ver'].isnull().sum()

0

In [94]:
# Looking at unique values

apps['Android Ver'].unique()

array(['4.0 and up', '5.0 and up', '3.0 and up', 'Unknown', '7.0 and up',
       '4.0.3 and up', '4.2 and up', '4.3 and up', '4.1 and up',
       'Varies with device', '3.2 and up', '2.1 and up', '2.3 and up',
       '2.3.3 and up', '4.4 and up', '7.0 - 7.1.1', '2.2 and up',
       '6.0 and up', '5.1 and up', '1.6 and up', '4.4W and up',
       '1.5 and up', '4.0.3 - 7.1.1', '2.0 and up', '2.0.1 and up',
       '8.0 and up', '1.0 and up', '3.1 and up', '5.0 - 8.0',
       '7.1 and up', '5.0 - 6.0', '4.1 - 7.1.1', '5.0 - 7.1.1',
       '2.2 - 7.1.1'], dtype=object)

In [95]:
# earlier (10841,13)

# Apps dataset new dimension
apps.shape

(9637, 14)

___
### Sanctity Checks

###### Few checks/conditions that need to be fulfilled:

- Rating for the Apps should be less than 5.
- Reviews should not be greater than Installs for any app.
- Free apps should not be priced/charged. 

**1) Rating for the apps should be less than 5**

In [96]:
# Looking for records where rating for an app is greater than 5

apps['Rating']>5

8133    False
3859    False
9828    False
9819    False
5844    False
        ...  
4636    False
4541    False
8532    False
324     False
8884    False
Name: Rating, Length: 9637, dtype: bool

In [97]:
# Looking at number of records where an app has a rating > 5

apps[apps['Rating']>5].shape

(0, 14)

- No app has rating above 5.

In [98]:
apps['Rating'].describe()

count    9637.000000
mean        3.542908
std         1.574802
min         0.000000
25%         3.600000
50%         4.200000
75%         4.500000
max         5.000000
Name: Rating, dtype: float64

##### Observation:

- All the Ratings are well within the defined limit i.e. less than 5.

**2) Reviews should not be greater than Installs for any app**

In [99]:
# looking for records where Reviews for an app are greater than Installs

apps[apps['Reviews']>apps['Installs']]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genre_1,Genre_2,Last Updated,Current Ver,Android Ver
7402,Trovami Se Ci Riesci,GAME,5.0,11,6.1,10,Free,0.0,Everyone,Arcade,Not Applicable,2017-11-03,0.1,2.3 and up
6508,Sam.Bn Pro,TOOLS,0.0,11,2.0,10,Paid,0.99,Everyone,Tools,Not Applicable,2015-03-27,1.0.0,4.0.3 and up
4550,Rmedus - ????? ??? R ????? ?? ???,FAMILY,0.0,4,64.0,1,Free,0.0,Everyone,Education,Not Applicable,2018-07-17,1.0.1,4.4 and up
5917,Ra Ga Ba,GAME,5.0,2,20.0,1,Paid,1.49,Everyone,Arcade,Not Applicable,2017-08-02,1.0.4,2.3 and up
10697,Mu.F.O.,GAME,5.0,2,16.0,1,Paid,0.99,Everyone,Arcade,Not Applicable,2017-03-03,1,2.3 and up
2454,Kba-Ez Health Guide,MEDICAL,5.0,4,25.0,1,Free,0.0,Everyone,Medical,Not Applicable,2018-02-08,1.0.72,4.0.3 and up
9096,Dz Puzzle,FAMILY,0.0,14,47.0,10,Paid,0.99,Everyone,Puzzle,Not Applicable,2017-04-22,1.2,2.3 and up
8591,Dn Blog,SOCIAL,5.0,20,4.2,10,Free,0.0,Teen,Social,Not Applicable,2018-07-23,1,4.0 and up
6700,Brick Breaker Br,GAME,5.0,7,19.0,5,Free,0.0,Everyone,Arcade,Not Applicable,2018-07-23,1,4.1 and up
5812,Ax Watch For Watchmaker,PERSONALIZATION,0.0,2,0.23,1,Paid,0.99,Everyone,Personalization,Not Applicable,2017-08-18,1,2.3 and up


In [100]:
# no. of records
apps[apps['Reviews']>apps['Installs']].shape

(11, 14)

##### Observation:

- There are 11 records where Reviews are greater than Installs which is `not an ideal case`, hence we'll be storing only those records where `Reviews are less than or equal to Installs`.   

In [101]:
# Storing only those records where Reviews for an app are less than or equal to (<=) Installs

apps = apps[apps['Reviews']<= apps['Installs']]

In [102]:
# Cross- checking for apps where Reviews are greater than Installs

apps[apps['Reviews']>apps['Installs']].shape

(0, 14)

**3) Free apps should not be priced**

In [103]:
# Looking for records where Free apps that are Priced/Charged

apps[(apps['Type']=='Free') & (apps['Price']>0)]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genre_1,Genre_2,Last Updated,Current Ver,Android Ver


In [104]:
# Looking at number of records where ree apps that are Priced/Charged

apps[(apps['Type']=='Free') & (apps['Price']>0)].shape

(0, 14)

###### Observation:

- No Free App has been charged, thus fulfilling our condition.

- `All the Sanctity checks has been well performed`.

___

### Final checks:

In [105]:
# Looking at information of apps dataset

apps.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9626 entries, 8133 to 8884
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   App             9626 non-null   object        
 1   Category        9626 non-null   object        
 2   Rating          9626 non-null   float64       
 3   Reviews         9626 non-null   int32         
 4   Size            9626 non-null   float32       
 5   Installs        9626 non-null   int32         
 6   Type            9626 non-null   object        
 7   Price           9626 non-null   float32       
 8   Content Rating  9626 non-null   object        
 9   Genre_1         9626 non-null   object        
 10  Genre_2         9626 non-null   object        
 11  Last Updated    9626 non-null   datetime64[ns]
 12  Current Ver     9626 non-null   object        
 13  Android Ver     9626 non-null   object        
dtypes: datetime64[ns](1), float32(2), float64(1), int32(2

In [106]:
# Looking at the number of unique values for each column in apps dataset

apps.nunique()

App               9626
Category            33
Rating              40
Reviews           5329
Size               274
Installs            20
Type                 2
Price               90
Content Rating       6
Genre_1             48
Genre_2              7
Last Updated      1377
Current Ver       2770
Android Ver         34
dtype: int64

___

## Reviews Dataset

In [107]:
# Looking at top 5 rows

reviews.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


In [108]:
# Looking at information of reviews dataset

reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64295 entries, 0 to 64294
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     64295 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37432 non-null  object 
 3   Sentiment_Polarity      37432 non-null  float64
 4   Sentiment_Subjectivity  37432 non-null  float64
dtypes: float64(2), object(3)
memory usage: 2.5+ MB


In [109]:
# Looking at dimension of the dataset

reviews.shape

(64295, 5)

In [110]:
# looking for duplicate vaules

reviews.duplicated().sum()

33616

- Not removing duplicates as a single app can have multiple reviews thus multiple entry of the same app.

In [111]:
# looking for null values

reviews.isnull().sum()

App                           0
Translated_Review         26868
Sentiment                 26863
Sentiment_Polarity        26863
Sentiment_Subjectivity    26863
dtype: int64

In [112]:
# Looking at number of unique values present in each column of reviews dataset 

reviews.nunique()

App                        1074
Translated_Review         27994
Sentiment                     3
Sentiment_Polarity         5410
Sentiment_Subjectivity     4474
dtype: int64

**App Column**

In [113]:
# Looking at unique values

reviews['App'].unique()

array(['10 Best Foods for You', '104 找工作 - 找工作 找打工 找兼職 履歷健檢 履歷診療室',
       '11st', ..., 'Hotwire Hotel & Car Rental App',
       'Housing-Real Estate & Property', 'Houzz Interior Design Ideas'],
      dtype=object)

In [114]:
# Lookig at counts of sub-type of App column

reviews['App'].value_counts()

Bowmasters                                           320
Angry Birds Classic                                  320
CBS Sports App - Scores, News, Stats & Watch Live    320
Helix Jump                                           300
8 Ball Pool                                          300
                                                    ... 
Easy Healthy Recipes                                  31
Dresses Ideas & Fashions +3000                        31
Detector de Radares Gratis                            31
Drawing Clothes Fashion Ideas                         30
Easy Hair Style Design                                30
Name: App, Length: 1074, dtype: int64

**Translated_Review column**

In [115]:
# Removing all the rows with null values as they would not help in analysis

reviews = reviews[~reviews['Translated_Review'].isnull()]

In [116]:
# Cross - checking for null values

reviews['Translated_Review'].isnull().sum()

0

In [117]:
# Lookig at unique values

reviews['Translated_Review'].unique()

array(['I like eat delicious food. That\'s I\'m cooking food myself, case "10 Best Foods" helps lot, also "Best Before (Shelf Life)"',
       'This help eating healthy exercise regular basis',
       'Works great especially going grocery store', ...,
       'Dumb app, I wanted post property rent give option. Website work. Waste time space phone.',
       'I property business got link SMS happy performance still guys need raise bar guys Cheers',
       'Useless app, I searched flats kondapur, Hyderabad . None number reachable I know flats unavailable would keep posts active'],
      dtype=object)

**Sentiment column**

In [118]:
# Looking for null values

reviews['Sentiment'].isnull().sum()

0

In [119]:
# Looking at unique values

reviews['Sentiment'].unique()

array(['Positive', 'Neutral', 'Negative'], dtype=object)

In [120]:
# Looking at counts of sub-type of Sentiment column

reviews['Sentiment'].value_counts()

Positive    23998
Negative     8271
Neutral      5158
Name: Sentiment, dtype: int64

**Sentiment_Polarity column**

In [121]:
# looking for null values

reviews['Sentiment_Polarity'].isnull().sum()

0

In [122]:
# Looking for unique values

reviews['Sentiment_Polarity'].unique()

array([ 1.        ,  0.25      ,  0.4       , ..., -0.52857143,
       -0.37777778,  0.17333333])

In [123]:
# Rounding the values to 3 decimal digits

reviews['Sentiment_Polarity'] = np.round(reviews['Sentiment_Polarity'], 3)

In [124]:
# Cross-checking the unique values

reviews['Sentiment_Polarity'].unique()

array([ 1.   ,  0.25 ,  0.4  , ...,  0.593, -0.988, -0.529])

In [125]:
# Looking at the description of Sentiment_Polarity column

reviews['Sentiment_Polarity'].describe()

count    37427.000000
mean         0.182166
std          0.351319
min         -1.000000
25%          0.000000
50%          0.150000
75%          0.400000
max          1.000000
Name: Sentiment_Polarity, dtype: float64

**Sentiment_Subjectivity column**

In [126]:
# looking for null values

reviews['Sentiment_Subjectivity'].isnull().sum()

0

In [127]:
# Looking for unique values

reviews['Sentiment_Subjectivity'].unique()

array([0.53333333, 0.28846154, 0.875     , ..., 0.51145833, 0.7172619 ,
       0.2594697 ])

In [128]:
# Rounding the values to 3 decimal digits

reviews['Sentiment_Subjectivity'] = np.round(reviews['Sentiment_Subjectivity'], 3)

In [129]:
# Cross-checking the unique values 
reviews['Sentiment_Subjectivity'].unique()

array([0.533, 0.288, 0.875, 0.3  , 0.9  , 0.   , 0.6  , 0.1  , 0.867,
       0.511, 1.   , 0.667, 0.8  , 0.35 , 0.5  , 0.69 , 0.2  , 0.75 ,
       0.675, 0.356, 0.55 , 0.15 , 0.412, 0.525, 0.65 , 0.833, 0.467,
       0.367, 0.597, 0.45 , 0.735, 0.475, 0.4  , 0.683, 0.286, 0.622,
       0.78 , 0.717, 0.25 , 0.275, 0.633, 0.587, 0.143, 0.767, 0.739,
       0.302, 0.364, 0.05 , 0.455, 0.375, 0.478, 0.458, 0.536, 0.333,
       0.562, 0.324, 0.417, 0.775, 0.917, 0.542, 0.357, 0.792, 0.067,
       0.644, 0.7  , 0.517, 0.706, 0.648, 0.579, 0.479, 0.594, 0.378,
       0.705, 0.506, 0.575, 0.52 , 0.975, 0.572, 0.192, 0.697, 0.688,
       0.558, 0.85 , 0.513, 0.625, 0.401, 0.436, 0.42 , 0.581, 0.322,
       0.469, 0.425, 0.41 , 0.544, 0.368, 0.599, 0.607, 0.749, 0.555,
       0.564, 0.326, 0.418, 0.585, 0.653, 0.521, 0.651, 0.92 , 0.771,
       0.578, 0.737, 0.708, 0.637, 0.258, 0.583, 0.397, 0.718, 0.371,
       0.48 , 0.568, 0.249, 0.758, 0.61 , 0.483, 0.502, 0.733, 0.489,
       0.889, 0.54 ,

In [130]:
# Looking at description of Sentiment_Subjectivity column

reviews['Sentiment_Subjectivity'].describe()

count    37427.000000
mean         0.492770
std          0.259905
min          0.000000
25%          0.357000
50%          0.514000
75%          0.650000
max          1.000000
Name: Sentiment_Subjectivity, dtype: float64

**After cleaning both the files**

In [131]:
# new dimension of apps dataset

apps.shape

(9626, 14)

- older dimension (10841,13)

In [132]:
# new dimension of reviews dataset

reviews.shape

(37427, 5)

- older dimension (64295,5)

**converting new cleaned files to csv**

In [133]:
# Converting apps to cleaned_playstore_apps.csv

apps.to_csv('cleaned_playstore_apps.csv', header=True, index=False)

In [134]:
# Converting reviewss to cleaned_playstore_reviews.csv

reviews.to_csv('cleaned_playstore_reviews.csv', header=True, index=False)