## Import Libraries



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np 
import warnings
warnings.filterwarnings('ignore')

## Importing Dataset

In [None]:
df=pd.read_csv('../input/playstore-analysis-eda/playstore-analysis.csv')

In [None]:
df.head()

## 1. Data clean up – Missing value treatment

### a. Drop records where rating is missing since rating is our target/study variable

In [None]:
df.info()

In [None]:
df.dropna(axis="rows",how="any",subset=["Rating"],inplace=True)

In [None]:
df.info()

### b. Check the null values for the Android Ver column.

In [None]:
df["Android Ver"].isna().sum()

In [None]:
df[df.isnull().any(axis="columns")]

i. Are all 3 records having the same problem?

Yes, all 3 records are having same problem ie all are NaN.

### ii. Drop the 3rd record i.e. record for “Life Made WIFI …”

In [None]:
df.drop([10472],inplace=True)

### iii. Replace remaining missing values with the mode

In [None]:
df["Current Ver"].fillna(df["Current Ver"].mode()[0],inplace=True)

In [None]:
df["Android Ver"].fillna(df["Android Ver"].mode()[0],inplace=True)

In [None]:
df.info()

As we can see there are no null values now in our data frame we can move to further analysis.

## 2. Data clean up – correcting the data types

### a. Which all variables need to be brought to numeric types?

In [None]:
df["Reviews"]=df["Reviews"].astype("int")

In [None]:
df.info()

Reviews column data type is change to int

### b. Price variable – remove $ sign and convert to float

In [None]:
df["Price"]=df["Price"].str.replace('$',"")

$ sign is removed from the Price Column

In [None]:
df["Price"]=df["Price"].astype("float")

Price column is converted into float

### c. Installs – remove ‘,’ and ‘+’ sign, convert to integer

In [None]:
df["Installs"]=df["Installs"].str.replace(',',"")

In [None]:
df["Installs"]=df["Installs"].str.replace('+',"")

In [None]:
df.head()

"," and "+" sign are removed from Installs column

In [None]:
df["Installs"]=df["Installs"].astype("int")

In [None]:
df.info()

### d. Convert all other identified columns to numeric

In [None]:
df["Size"]=df["Size"].astype("int")

## 3. Sanity checks – check for the following and handle accordingly

### a. Avg. rating should be between 1 and 5, as only these values are allowed on the play store.

In [None]:
df[(df["Rating"]>=1) & (df["Rating"]<=5)]

There are no such records with rating less than 1 or greater than 5.

## b. Reviews should not be more than installs as only those who installed can review the app.

### i. Are there any such records? Drop if so.

In [None]:
df["New"]=np.where(df["Reviews"] > df["Installs"],"True","False")

In [None]:
drop_indexes=df[df["New"]=="True"].index

In [None]:
df.drop(drop_indexes,inplace=True)

In [None]:
df

Dropped Rows where No. of reviews are greater than no. of installs Because those reviews where fake.

## 4. Identify and handle outliers –

### a. Price column
      
#### i. Make suitable plot to identify outliers in price

In [None]:
fig,ax=plt.subplots()
plt.boxplot(df["Price"])
plt.show()

### ii. Do you expect apps on the play store to cost $200? Check out these cases

In [None]:
df[df["Price"]>200]

Yes we can expect apps worth 200$ or more on playstore

### iii. Limit data to records with price < $30

In [None]:
df=df[df["Price"]<30]

### iv. After dropping the useless records, make the suitable plot again to identify outliers

In [None]:
fig,ax=plt.subplots()
plt.boxplot(df["Price"])
plt.show()

## b) Reviews column

### i. Make suitable plot

In [None]:
sns.distplot(df['Reviews'])
plt.show()

### ii. Limit data to apps with < 1 Million reviews

In [None]:
df=df[df["Reviews"]<1000000]

## c. Installs

### i. What is the 95th percentile of the installs?

In [None]:
Percentile=df.Installs.quantile(0.95)
print(Percentile)

### ii. Drop records having a value more than the 95th percentile

In [None]:
drop1=df[df["Installs"]>Percentile].index

In [None]:
df=df.drop(drop1)

# Data analysis to answer business questions

### Task 5. What is the distribution of ratings like? (use Seaborn) More skewed towards higher/lower values?

### a. How do you explain this?

In [None]:
sns.distplot(df['Rating'])
plt.show()
print('The skewness of this distribution is',df['Rating'].skew())
print('The Median of this distribution {} is greater than mean {} of this distribution'.format(df.Rating.median(),df.Rating.mean()))

### b. What is the implication of this on your analysis?

In [None]:
mode=df['Rating'].mode()
mean=df['Rating'].mean()
median=df['Rating'].median()
print(mode)
print(mean)
print(median)
print("As we can see that mode>=median>mean,this distrubution of Ratings are negatively skewed")

## 6. What are the top Content Rating values?

In [None]:
df["Content Rating"].value_counts()

### a. Are there any values with very few records?

Yes Adults only 18+ and Unrated are values with very few records so we drop them.

In [None]:
drop2=df[df["Content Rating"]=="Adults only 18+"].index
drop3=df[df["Content Rating"]=="Unrated"].index

In [None]:
df=df.drop(drop2)

In [None]:
df=df.drop(drop3)

## 7. Effect of size on rating

### a. Make a joinplot to understand the effect of size on rating

In [None]:
sns.jointplot(x = "Rating", y = "Size",kind = "hex", data = df)
plt.show()

### b. Do you see any patterns?

### c. How do you explain the pattern?

Yes we see some patterns,in this plot rating of apps and size of apps are plotted with their distribution also.
From this plot we can infer that those apps having ratings between 4.0 and 4.5 have size around 20000. 

## 8. Effect of price on rating

### a. Make a jointplot (with regression line)

In [None]:
sns.jointplot(x="Price",y="Rating",data=df,kind="reg")
plt.show()

### b. What pattern do you see?

Generally on increasing the Price, Rating remains almost constant greater than 4.

### c. How do you explain the pattern?

Since on increasing the Price, Rating remains almost constant greater than 4. Thus it can be concluded that their is very weak Positive correlation between Rating and Price.

### d. Replot the data, this time with only records with price > 0

In [None]:
df1=df[df["Price"]>0]

In [None]:
sns.jointplot(x="Price",y="Rating",data=df1,kind="reg")
plt.show()

### e. Does the pattern change?

Yes, On limiting the record with Price > 0, the overall pattern changed a slight ie their is very weakly Negative Correlation between Price and Rating.

### f. What is your overall inference on the effect of price on the rating

Generally increasing the Prices, doesn't have signifcant effect on Higher Rating. For Higher Price, Rating is High and almost constant ie greater than 4

## 9. Look at all the numeric interactions together –

### a. Make a pairplort with the colulmns - 'Reviews', 'Size', 'Rating', 'Price'

In [None]:
sns.pairplot(df, vars=['Reviews', 'Size', 'Rating', 'Price'], kind='reg')
plt.show()

## 10. Rating vs. content rating

### a. Make a bar plot displaying the rating for each content rating

In [None]:
df.groupby(['Content Rating'])['Rating'].count().plot.bar(color="#19F1E9")
plt.show()

### b. Which metric would you use? Mean? Median? Some other quantile?

We must use Median in this case as we are having Outliers in Rating. Because in case of Outliers , median is the best measure of central tendency.

In [None]:
fig,ax=plt.subplots()
ax.boxplot(df['Rating'])
ax.set_xticklabels(["App"])
ax.set_ylabel("Ratings")
plt.show()

### c. Choose the right metric and plot

In [None]:
df.groupby(['Content Rating'])['Rating'].median().plot.barh(color="#19F1E9")
plt.show()

## 11. Content rating vs. size vs. rating – 3 variables at a time

### a. Create 5 buckets (20% records in each) based on Size

In [None]:
bins=[0, 20000, 40000, 60000, 80000, 100000]
df['Bucket Size'] = pd.cut(df['Size'], bins, labels=['0-20k','20k-40k','40k-60k','60k-80k','80k-100k'])
pd.pivot_table(df, values='Rating', index='Bucket Size', columns='Content Rating')

### b. By Content Rating vs. Size buckets, get the rating (20th percentile) for each combination

In [None]:
df3=pd.pivot_table(df, values='Rating', index='Bucket Size', columns='Content Rating', aggfunc=lambda x:np.quantile(x,0.2))
df3

### c. Make a heatmap of this

#### i. Annotated



In [None]:
f,ax = plt.subplots()
sns.heatmap(df3, annot=True, linewidths=.5,fmt='.1f')
plt.show()


#### ii.Greens color map

In [None]:
f,ax = plt.subplots()
sns.heatmap(df3, annot=True, linewidths=.5, cmap='Greens',fmt='.1f')
plt.show()

### d. What’s your inference? Are lighter apps preferred in all categories? Heavier? Some?

No lighter apps are not really preferred in all categories because through this heatmap plot we can easily conclude that heavier app and apps that are in between heavier and lighter and preferred in all categories.