# Google Play Market Analysis

In this work I'll make some analysis of [data set from here](https://www.kaggle.com/gauthamp10/google-playstore-apps?select=Google-Playstore-32K.csv). Its contain data up to April 2019. There are two sets: full version consisting of 267K app data and minimal version consisting of 32K app data. In this work I'll use fullset.

This is a course project of [Zero to Pandas course](http://zerotopandas.com).

In [None]:
import os

import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline


## Data Preparation and Cleaning

### Importing data from file and take a sample to see how it looks:

In [None]:
data_df = pd.read_csv('../input/google-playstore-apps/Google-Playstore-Full.csv')

In [None]:
data_df.sample(10)

In [None]:
data_df.shape

### Lets look at columns names and their type:

In [None]:
data_df.columns

In [None]:
data_df.info()

### Looks like some data columns are shifted and almost all of them have Dtype as object. So lets make a copy of data and start make it more suitable for analisis.

In [None]:
PMS_df = data_df.copy()

### In Category we have strange values. Lets move data in correct columns and replace missing values with NaN.

In [None]:
PMS_df.Category.value_counts().tail(20)

In [None]:
# shifted
strange_data = [' ETEA & MDCAT', ' not notified you follow -', '6', ' Speaker Pro 2019', ' Alfabe �?ren', ' Mexpost)', ' Podcasts', ' Accounting', ' Islamic Name Boy & Girl+Meaning']
a = PMS_df.Category.isin(strange_data)
index = []
for i in range(len(a)):
    if a[i] == True:
        index.append(i)
PMS_df.iloc[index]

In [None]:
# replaced
PMS_df.iloc[index[0], 1:10] = list(PMS_df.iloc[index[0], 4:13])
PMS_df.iloc[index[0], 11:] = np.nan
PMS_df.iloc[index[1],0:10] = list(PMS_df.iloc[index[1], 4:14])
PMS_df.iloc[index[1], 11:] = np.nan
PMS_df.iloc[index[2], 1:10] = list(PMS_df.iloc[index[2], 2:11])
PMS_df.iloc[index[2], 11:] = np.nan
PMS_df.iloc[index[3], 1:10] = list(PMS_df.iloc[index[3], 2:11])
PMS_df.iloc[index[3], 11:] = np.nan
PMS_df.iloc[index[4], 1:10] = list(PMS_df.iloc[index[4], 2:11])
PMS_df.iloc[index[4], 11:] = np.nan
PMS_df.iloc[index[5], 1:10] = list(PMS_df.iloc[index[5], 2:11])
PMS_df.iloc[index[5], 11:] = np.nan
PMS_df.iloc[index[6], 1:10] = list(PMS_df.iloc[index[6], 2:11])
PMS_df.iloc[index[6], 11:] = np.nan
PMS_df.iloc[index[7], 1:10] = list(PMS_df.iloc[index[7], 2:11])
PMS_df.iloc[index[7], 11:] = np.nan
PMS_df.iloc[index[8], 1:10] = list(PMS_df.iloc[index[8], 3:12])
PMS_df.iloc[index[8], 11:] = np.nan

PMS_df.iloc[index]

In [None]:
# shifted
strange_data = [' Channel 2 News', 'Gate ALARM', ' T�rk Alfabesi', ' super loud speaker booster', ' Tour Guide', ' Romantic Song Music Love Songs', ' Breaking News', ')', 'TRAVEL']
a = PMS_df.Category.isin(strange_data)
index = []
for i in range(len(a)):
    if a[i] == True:
        index.append(i)
PMS_df.iloc[index]

In [None]:
# replaced
PMS_df.iloc[index[0],1:10] = list(PMS_df.iloc[index[0], 2:11])
PMS_df.iloc[index[0], 11:] = np.nan
PMS_df.iloc[index[1], 1:10] = list(PMS_df.iloc[index[1], 2:11])
PMS_df.iloc[index[1], 11:] = np.nan
PMS_df.iloc[index[2], 1:10] = list(PMS_df.iloc[index[2], 2:11])
PMS_df.iloc[index[2], 11:] = np.nan
PMS_df.iloc[index[3], 1:10] = list(PMS_df.iloc[index[3], 2:11])
PMS_df.iloc[index[3], 11:] = np.nan
PMS_df.iloc[index[4], 1:10] = list(PMS_df.iloc[index[4], 2:11])
PMS_df.iloc[index[4], 11:] = np.nan
PMS_df.iloc[index[5], 1:10] = list(PMS_df.iloc[index[5], 2:11])
PMS_df.iloc[index[5], 11:] = np.nan
PMS_df.iloc[index[6], 1:10] = list(PMS_df.iloc[index[6], 2:11])
PMS_df.iloc[index[6], 11:] = np.nan
PMS_df.iloc[index[8], 1:10] = list(PMS_df.iloc[index[8], 2:11])
PMS_df.iloc[index[8], 11:] = np.nan

PMS_df.iloc[index]

### One more check for shifted data:

In [None]:
print(PMS_df['Unnamed: 11'].unique())
print(PMS_df['Unnamed: 12'].unique())
print(PMS_df['Unnamed: 13'].unique())
print(PMS_df['Unnamed: 14'].unique())

In [None]:
# shifted
strange_data = ['4.0.0.0']
a = PMS_df['Unnamed: 11'].isin(strange_data)
index = []
for i in range(len(a)):
    if a[i] == True:
        index.append(i)
PMS_df.iloc[index]

In [None]:
# replaced
PMS_df.iloc[index[0],1:10] = list(PMS_df.iloc[index[0], 2:11])
PMS_df.iloc[index[0], 11:] = np.nan

PMS_df.iloc[index]

### Looks like nothing wrong with Rating and Reviews, so lets covert them.

In [None]:
# Rating
PMS_df.Rating.value_counts()

In [None]:
PMS_df['Rating'] = pd.to_numeric(PMS_df.Rating, errors='coerce')

In [None]:
# Reviews
PMS_df.Reviews.value_counts()

In [None]:
PMS_df['Reviews'] = pd.to_numeric(PMS_df.Reviews, errors='coerce')

### Next need to edit and covnert Installs column.

In [None]:
PMS_df.Installs.value_counts()

In [None]:
PMS_df.Installs = PMS_df.Installs.str.replace('+','')
PMS_df.Installs = PMS_df.Installs.str.replace(',','')
PMS_df['Installs'] = pd.to_numeric(PMS_df.Installs, errors='coerce')

### Size column looks fine, so lets it be that way.

In [None]:
PMS_df.Size.value_counts()

### Next editing Price column. And lets make Distribution model column, in which store 'free' if app price is zero and 'paid' in other cases.

In [None]:
PMS_df.Price.value_counts()

In [None]:
PMS_df.Price = PMS_df.Price.str.replace('$','')
PMS_df['Price'] = pd.to_numeric(PMS_df.Price, errors='coerce')

In [None]:
distribution_model = ['Free' if i == 0 else 'Paid' for i in PMS_df['Price']]
PMS_df['Distribution model'] = pd.Series(distribution_model, name = 'Distribution model')

### Content Rating looks fine.

In [None]:
PMS_df['Content Rating'].value_counts()

### Now lets drop Last Updated, Minimum Version, Latest Version and Unnamed columns from array.

In [None]:
PMS_df = PMS_df.drop(['Last Updated', 'Minimum Version',
       'Latest Version', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13',
       'Unnamed: 14'],axis=1)

# Exploratory Analysis and Visualization

In [None]:
PMS_df

In [None]:
PMS_df.info()

In [None]:
PMS_df.describe()



### Now lets visualize this data set with plots.  

### Category distribution:

In [None]:
plt.figure(figsize=(20,5))
plt.title('Category distribution')
PMS_df.Category.value_counts().plot(kind='bar',)


### There are lots of different game categories, so lets make separated plots for games and other apps.

In [None]:
games = PMS_df[PMS_df.Category.str.contains('GAME', regex=False)]
other = PMS_df[~PMS_df.Category.str.contains('GAME', regex=False)]

In [None]:
plt.figure(figsize=(20,5))
plt.title('Games category distribution')
games.Category.value_counts().plot(kind='bar',)

In [None]:
plt.figure(figsize=(20,5))
plt.title('Other apps category distribution')
other.Category.value_counts().plot(kind='bar',)

### Rating distribution excluding apps with less then 1000 reviews.

In [None]:
plt.figure(figsize=(10,10))
plt.title('Ratings distribution')
sns.distplot(PMS_df.Rating[PMS_df.Reviews > 1000], kde=False)

## Content Rating.

In [None]:
plt.figure(figsize=(8,8))
plt.title('Content Rating distribution')
PMS_df['Content Rating'].value_counts().plot(kind='bar')

# Asking and Answering Questions


### Lets see how Reviews affect the Rating. For better look lets split in for less then 1 million of Reviews and above it.

In [None]:
rev = 1000000

rat1 = PMS_df.Rating[PMS_df.Reviews <= rev]
rev1 = PMS_df.Reviews[PMS_df.Reviews <= rev]

rat2 = PMS_df.Rating[PMS_df.Reviews > rev]
rev2 = PMS_df.Reviews[PMS_df.Reviews > rev]

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(30,8))
fig.suptitle('Reviews affect on the Rating')

ax1.scatter(rev1,rat1)
ax1.set_xlabel('Reviews')
ax1.set_ylabel('Rating')

ax2.scatter(rev2,rat2)
ax2.set_xlabel('Reviews')
ax2.set_ylabel('Rating')

plt.show()

So, in overall number of Reviews dont affect the Rating.

### What the Mean Rating per Category

In [None]:
plt.figure(figsize=(20,8))
plt.title('Mean Rating per Category')
plt.grid()
plt.xlabel('Category')
plt.xticks(rotation=90)
plt.ylabel('Rating')

d = PMS_df.groupby('Category')['Rating'].mean().reset_index()
plt.scatter(d.Category, d.Rating)

Book and Reference apps have higest mean rate and Travel apps have lowest mean rate.

### What is most common Distribution model in App Store?

In [None]:
Dist_method = PMS_df['Distribution model'].value_counts()
plt.figure(figsize=(10,10))
plt.title('Dist model')
plt.pie(Dist_method, labels=Dist_method.index, autopct='%1.1f%%', startangle=180);

So its 95.7% of app that free to download.

### What is most common price of paid apps?

In [None]:
plt.figure(figsize=(20,5))
price = PMS_df.Price[PMS_df.Price > 0].value_counts()
(price.head(50)).plot(kind = 'bar')

### What is most common Size apps have?

In [None]:
plt.figure(figsize=(20,5))
size = PMS_df.Size.value_counts(normalize = False)

(size.head(100)).plot(kind = 'bar')

So size of most apps Varies with device.