# **Project Name**    - Play Store App Review Analysis



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual
##### **Team Member  -** Sakshi  Pande


# **Project Summary -**

In this notebook, I'm gonna analyze Google Play Store datas. While I was analyzing the data, I used Python. This study is my first data analyzing study.

Google Play Store apps and reviews Mobile apps are everywhere. They are easy to create and can be lucrative. Because of these two factors, more and more apps are being developed. In this notebook, we will do a comprehensive analysis of the Android app market by comparing over ten thousand apps in Google Play across different categories. We'll look for insights in the data to devise strategies to drive growth and retention.

Let's take a look at the data, which consists of two files:

* **playstore data.csv:** contains all the details of the applications on Google Play. There are 13 features that describe a given app.
* **user_reviews.csv:** contains 100 reviews for each app, most helpful first. The text in each review has been pre-processed and attributed with three new features: Sentiment (Positive, Negative or Neutral), Sentiment Polarity and Sentiment Subjectivity.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


1. What are the top categories on Play Store?
2. Are majority of the apps Paid or Free?
3. How importance is the rating of the application?
4. Which categories from the audience should the app be based on?
5. Which category has the most no. of installations?
6. How does the count of apps varies by Genres?
7. How does the last update has an effect on the rating?
8.  What is the percentage of review sentiments?
9. Does Last Update date has an effects on rating?
10. Distribution of App update over the Year.


#### **Define Your Business Objective?**

The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. Each app (row) has values for category, rating, size, and more. Another dataset contains customer reviews of the android apps. Explore and analyse the data to discover key factors responsible for app engagement and success.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

# **1. Know Your Data**

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
playstore_path = 'Play Store Data.csv'
reviews_path = 'User Reviews.csv'


### Dataset First View

In [None]:
# Dataset First Look
playstore_df = pd.read_csv(playstore_path)
user_df = pd.read_csv(reviews_path)


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f"Playstore Dataframe : Rows count -  {playstore_df.shape[0]}, Column count - {playstore_df.shape[1]}")


In [None]:
print(f"User review Dataframe : Rows count -  {user_df.shape[0]}, Column count - {user_df.shape[1]}")


1. Playstore dataframe has 10841 rows and 13 columns.
2. User_reviews dataframe has 64295 rows and 5 columns.



### Dataset Information

In [None]:
# Dataset Describe
playstore_df.describe()

In [None]:
# Dataset Info
def playstore_info():
    temp = pd.DataFrame(index = playstore_df.columns)
    temp['datatype'] = playstore_df.dtypes
    temp['not null value'] = playstore_df.count()
    temp['null value'] = playstore_df.isnull().sum()
    temp['% of null value'] = playstore_df.isnull().mean()
    temp['unique count'] = playstore_df.nunique()
    return temp
playstore_info()

#### What did you know about your dataset?

Rating has 1474 null values which contributes 13.60% of the data. Type has 1 null value which contributes 0.01% of the data. Content_Rating has 1 null value which contributes 0.01% of the data. Current_Ver has 8 null values which contributes 0.07% of the data. Android_Ver has 3 null values which contributes 0.03% of the data.

In [None]:
user_df.describe()

In [None]:
# Dataset Info
def user_info():
    temp = pd.DataFrame(index = user_df.columns)
    temp['datatype'] = user_df.dtypes
    temp['not null value'] = user_df.count()
    temp['null value'] = user_df.isnull().sum()
    temp['% of null value'] = user_df.isnull().mean()
    temp['unique count'] = user_df.nunique()
    return temp

user_info()

The number of null values are:

Translated_Review has 26868 null values which contributes 41.79% of the data.
Sentiment has 26863 null values which contributes 41.78% of the data.
Sentiment_Polarity has 26863 null values which contributes 41.78% of the data.
Sentiment_Subjectivity has 26863 null values which contributes 41.78% of the data.

# **2. Understanding Your Variables**

### Playstore_Df

In [None]:
# Dataset Columns
playstore_df.columns

#### Variables Description

The 13 columns are identified as below:
1. **App** - It tells us about the name of the application with a short description (optional).
2. **Category** - It gives the category to the app.
3. **Rating** - It contains the average rating the respective app received from its users.
4. **Reviews** - It tells us about the total number of users who have given a review for the application.
5. **Size** - It tells us about the size being occupied the application on the mobile phone.
6. **Installs** - It tells us about the total number of installs/downloads for an application.
7. **Type** - IIt states whether an app is free to use or paid.
8. **Price** - It gives the price payable to install the app. For free type apps, the price is zero.
9. **Content Rating** - It states whether or not an app is suitable for all age groups or not.
10. **Genres** - It tells us about the various other categories to which an application can belong.
11. **Last Updated** - It tells us about the when the application was updated.
12. **Current Ver** - It tells us about the current version of the application.
13.**Android Ver** - It tells us about the android version which can support the application on its platform.


### User_df

In [None]:
user_df.columns

#### Variable Description

 The 5 columns are identified as follows:

* **App:** Contains the name of the app with a short description (optional).
* **Translated_Review:** It contains the English translation of the review dropped by the user of the app.
* **Sentiment:** It gives the attitude/emotion of the writer. It can be ‘Positive’, ‘Negative’, or ‘Neutral’.
* **Sentiment_Polarity:** It gives the polarity of the review. Its range is [-1,1], where 1 means ‘Positive statement’ and -1 means a ‘Negative statement’.
* **Sentiment_Subjectivity:** This value gives how close a reviewers opinion is to the opinion of the general public. Its range is [0,1]. Higher the subjectivity, closer is the reviewers opinion to the opinion of the general public, and lower subjectivity indicates the review is more of a factual information.

# **3. Data Wrangling**

### Playstore_df

#### 1.Lets handle missing values

##### Android Ver column

In [None]:
# The rows containing NaN values in the Android Ver column
playstore_df[playstore_df['Android Ver'].isnull()]

In [None]:
# Finding the different values the 'Android Ver' column takes
playstore_df['Android Ver'].value_counts()

Since the NaN values in the Android Ver column cannot be replaced by any particular value, and, since there are only 3 rows which contain NaN values in this column, which accounts to less than 0.03% of the total rows in the given dataset, it can be be dropped.

In [None]:
#drop the null values
playstore_df = playstore_df[playstore_df['Android Ver'].notna()]

In [None]:
playstore_df['Android Ver'].isnull().sum()

##### Current Ver column

In [None]:
playstore_df['Current Ver'].isnull().sum()

Since there are only 8 rows which contain NaN values in the Current Ver column, and it accounts to just around 0.07% of the total rows in the given dataset, and there is no particular value with which we can replace it, these rows can be dropped.

In [None]:
#drop the null values
playstore_df = playstore_df[playstore_df['Current Ver'].notna()]

##### Type column

In [None]:
playstore_df[playstore_df['Type'].isnull()]

The Type column contains only two entries, namely, Free and Paid. Also, if the app is of type-paid, the price of that app will be printed in the corresponding Price column, else, it will show as '0'. In this case, the price for the respective app is printed as '0', which means the app is of type-free. Hence we can replace this NaN value with Free.

In [None]:
playstore_df.loc[9148, "Type"] = 'Free'

##### Rating column

In [None]:
playstore_df['Rating'].isnull().sum()

* The Rating column contains 1470 NaN values which accounts to apprximately 13.5% of the rows in the entire dataset. It is not practical to drop these rows because by doing so, we will loose a large amount of data, which may impact the final quality of the analysis.
* The NaN values in this case can be imputed by the aggregate (mean or median) of the remaining values in the Rating column.

In [None]:
mean_rating = round(playstore_df['Rating'].mean(),4)
median_rating = round(playstore_df['Rating'].median(),4)

[mean_rating, median_rating]

Plotting subplots to decide what to use mean or median?


In [None]:
fig, ax = plt.subplots(2,1, figsize=(12,7))
sns.distplot(playstore_df['Rating'],color='firebrick',ax=ax[0])
sns.boxplot(x='Rating',data=playstore_df, ax=ax[1])

* Use the mean when the data is normally distributed, numeric, and not skewed.
* Use the median when the data is numeric and skewed, or if there are outliers. The median is unaffected by outliers, unlike the mean.
* A box plot is a primary tool in EDA that can help determine skewness. The median is represented by a line in the middle of a rectangular box, with the 25th and 75th percentiles represented by the top and bottom of the box. If the median is not in the middle, the distribution is skewed.
* Here data is left skewed, so use median for handling missing values in Rating columns

In [None]:
playstore_df['Rating'].fillna(value = median_rating, inplace = True)

In [None]:
playstore_df['Rating'].isnull().sum()

#### 2. Handling duplicate values

In [None]:
playstore_df.duplicated().sum()

In [None]:
playstore_df['App'].value_counts()

ROBLOX App is duplicates

In [None]:
playstore_df.drop_duplicates(subset = 'App', inplace = True)

In [None]:
playstore_df[playstore_df['App'] == 'ROBLOX']

#### 3. Changing Datatypes

In [None]:
playstore_df.dtypes

##### Last updated column

In [None]:
playstore_df['Last Updated'] = pd.to_datetime(playstore_df['Last Updated'])
playstore_df['Last Updated'].head()

##### Price column

In [None]:
playstore_df['Price'].value_counts()

To convert this column from string to float, we must first drop the $ symbol from the all the values. Then we can assign float datatype to those values.

In [None]:
#drop dollar functio to drop $ if present
def drop_dollar(val):
    if '$' in val:
        return float(val[1:])
    else:
        return float(val)

In [None]:
#drop_dollar function applied to price column
playstore_df['Price'] = playstore_df['Price'].apply(lambda x : drop_dollar(x))


##### Installs columns

In [None]:
playstore_df['Installs'].value_counts()

To convert all the values in the Installs column from string datatype to integer datatype, we must first drop the '+' symbol from all the entries if present and then we can change its datatype.

In [None]:
def convert_plus(val):
  '''
  This function drops the + symbol if present and returns the value with int datatype.
  '''
  if '+' and ',' in val:
    new = int(val[:-1].replace(',',''))
    return new
  elif '+' in val:
    new1 = int(val[:-1])
    return new1
  else:
    return int(val)

In [None]:
playstore_df['Installs'] = playstore_df['Installs'].apply(lambda x : convert_plus(x))

In [None]:
playstore_df['Installs'].head()

##### Size column

In [None]:
playstore_df['Size'].value_counts()

We can see that the values in the Size column contains data with different units. 'M' stands for MB and 'k' stands for KB. To easily analyse this column, it is necessary to convert all the values to a single unit. In this case, we will convert all the units to MB.

We know that 1MB = 1024KB, to convert KB to MB, we must divide all the values which are in KB by 1024.

In [None]:
# Defining a function to convert all the entries in KB to MB and then converting them to float datatype.

def convert_kb_to_mb(val):
  '''
  This function converts all the valid entries in KB to MB and returns the result in float datatype.
  '''
  try:
    if 'M' in val:
      return float(val[:-1])
    elif 'k' in val:
      return round(float(val[:-1])/1024, 4)
    else:
      return val
  except:
    return val

In [None]:
playstore_df['Size'] = playstore_df['Size'].apply(lambda x : convert_kb_to_mb(x))

In [None]:
playstore_df['Size'].value_counts()

In [None]:
playstore_df['Size'] = playstore_df['Size'].apply(lambda x : str(x).replace('Varies with device','NaN') if 'Varies with device' in str(x) else x)

In [None]:
playstore_df['Size'] = playstore_df['Size'].apply(lambda x: float(x))

In [None]:
# checking for null values because we converted the varies with device to nan
playstore_df['Size'].isnull().sum()

We see that a vast majority of the entries in this column are of the value Varies with device, replacing this with any central tendency value (mean or median) may give incorrect visualizations and results. Hence these values are left as it is.

##### Reviews columns

In [None]:
playstore_df['Reviews'].value_counts()

In [None]:
playstore_df['Reviews'] = playstore_df['Reviews'].astype(int)

In [None]:
playstore_df.describe()

### User_df

In [None]:
user_df.isnull().sum()

There are a lot of NaN values. We need to analyse these values and see how we can handle them.

We can say that the apps which do not have a review (NaN value insted) tend to have NaN values in the columns Sentiment, Sentiment_Polarity, and Sentiment_Subjectivity in the majority of the cases.

In [None]:
user_df = user_df.dropna()

In [None]:
user_df.isnull().sum()

In [None]:
# Inspecting the sentiment column
user_df['Sentiment'].value_counts()

# **4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables**

# Data Visualization on play store data:

We have sucessfully cleaned the dirty data. Now we can perform some data visualization and come up with insights on the given datasets.


## Chart 1. Corelation Heatmap

In [None]:
playstore_df_numeric = playstore_df.select_dtypes(include=[np.number])

In [None]:
playstore_df_numeric.corr()

In [None]:
# heat map for plystore
plt.figure(figsize=(20,10))
sns.heatmap(playstore_df_numeric.corr(), annot = True)
plt.title('Corelation Heatmap for Playstore Data', size=20)

**Findings/Insights**

* There is a strong positive correlation between the Reviews and Installs column. This is pretty much obvious. Higher the number of installs, higher is the user base, and higher are the total number of reviews dropped by the users.
* The` Price `is slightly negatively correlated with the `Rating, Reviews, and Installs.` This means that as the prices of the app increases, the average rating, total number of reviews and Installs fall slightly.
* The` Rating` is slightly positively correlated with the` Installs and Reviews` column. This indicates that as the the average user rating increases, the app installs and number of reviews also increase.

**Why did I choose this chart?**

I choose this chart to find the corelation between multiple numerical variables based on color intensity

**Will the gained insights help creating a positive business impact?**

Recognizing that higher installs lead to more reviews suggests that encouraging more users to install the app can boost the overall feedback and visibility. This can be achieved through marketing efforts or referral programs.


> Let us check if there is any co-relation in both the dataframes.




In [None]:
merged_df = pd.merge(playstore_df, user_df, on='App', how = "inner")

In [None]:
def merged_dfinfo():
  temp = pd.DataFrame(index=merged_df.columns)
  temp['data_type'] = merged_df.dtypes
  temp["count of non null values"] = merged_df.count()
  temp['NaN values'] = merged_df.isnull().sum()
  temp['% NaN values'] =merged_df.isnull().mean()
  temp['unique_count'] = merged_df.nunique()
  return temp
merged_dfinfo()

**Co relation Heatmap of merged dataframe**

In [None]:
merged_df_numeric = merged_df.select_dtypes(include=[np.number]).corr()

In [None]:
# Heat Map for the merged data frame
plt.figure(figsize = (15,10))
sns.heatmap(merged_df_numeric, annot= True, cmap='Greens')
plt.title(' Heatmap for merged Dataframe', size=20)

In [None]:
merged_df = merged_df.dropna(subset=['Sentiment', 'Translated_Review'])

## Chart 2. What is the ratio of number of Paid apps and Free apps?

In [None]:
data = playstore_df['Type'].value_counts()
labels = ['Free', 'Paid']

#create a pie chart
plt.figure(figsize=(5,5))
colors = ["#00EE76","#7B8895"]
explode=(0.01,0.1)
plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%',explode=explode,textprops={'fontsize': 15})
plt.title('Distribution of Paid and Free apps',size=15,loc='center')
plt.legend()


**Findings/Insights**

From the above graph we can see that 92% of apps in google play store are free and 8%are paid.

**Why did I choose this chart?**

Pie chart is effective to calculate % distribution.

**Will the gained insights help creating a positive business impact?**

With 92.2% of apps being free, businesses should focus on freemium models and in-app purchases to attract and retain users.



## Chart 3. Which category of Apps from the Content Rating column are found more on playstore ?

In [None]:
# Content rating of the apps
data = playstore_df['Content Rating'].value_counts()
labels = ['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+','Adults only 18+', 'Unrated']

#create pie chart
plt.figure(figsize=(10,10))
explode=(0,0.1,0.1,0.1,0.0,1.3)
colors = ['C4', 'r', 'c', 'g', 'm', 'k']
plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%',explode=explode,textprops={'fontsize': 15})
plt.title('Content Rating',size=20,loc='center')
plt.legend()

**Findings/Insights**

Everyone content rating has the highest 81.80% on playstore, followed by teen.


**Why did I choose this chart**

Resason - Useful for % distribution

**Will the gained insights help creating a positive business impact?**

## Chart 4. Top categories on Google Playstore?

In [None]:
playstore_df.groupby('Category')['App'].count().sort_values(ascending = False)

In [None]:
x = playstore_df['Category'].value_counts()
y = playstore_df['Category'].value_counts().index

x_list = []
y_list = []

for i in range(len(x)):
    x_list.append(x[i])
    y_list.append(y[i])
    #print(x[i], y[i])

In [None]:
#Number of apps belonging to each category in the playstore
plt.figure(figsize=(20,10))
plt.xlabel('Number of Apps', size=15)
plt.ylabel('App Categories', size=15)
graph = sns.barplot(y = x_list, x = y_list, palette= "tab10")
graph.set_title("Top categories on Playstore", fontsize = 25)
graph.set_xticklabels(graph.get_xticklabels(), rotation= 45, horizontalalignment='right',);

**Findings/Insights**

So there are all total 33 categories in the dataset From the above output we can come to a conclusion that in playstore most of the apps are under FAMILY & GAME category and least are of EVENTS & BEAUTY Category.

**Why did I choose this chart?**

Barchart is effective to compare numerical data against category

**Will the gained insights help creating a positive business impact?**

Understanding the distribution of apps across categories allows businesses to allocate resources effectively, focusing on areas with the greatest potential for growth.

 Percentage of apps belonging to each category in the playstore




In [None]:
plt.figure(figsize= (20,20))
plt.pie(playstore_df.Category.value_counts(), labels =playstore_df.Category.value_counts().index, autopct='%1.2f%%' )
my_circle = plt.Circle( (0,0), 0.50, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.title('Percentage of apps share in each Category', fontsize = 25)
plt.show()

## Chart 5. Which category App's have most number of installs?

In [None]:
# total app installs in each category of the play store

a = playstore_df.groupby('Category')['Installs'].sum().sort_values(ascending = True)
a.plot.barh(figsize=(15,10), color = 'c', )
plt.ylabel('Total app Installs', fontsize = 15)
plt.xlabel('App Categories', fontsize = 15)
plt.xticks()
plt.title('Total app installs in each category', fontsize = 20)

**Findings/ Insights**

This tells us the category of apps that has the maximum number of installs. The Game, Communication and Tools categories has the highest number of installs compared to other categories of apps.

**Why did I choose this chart**

Again Numerical Vs Categorical

**Will the gained insights help creating a positive business impact?**

Understanding the app category popularity allows businesses to tailor their marketing strategies to focus on the most installed categories, increasing the likelihood of user engagement and retention.



## Chart 6. Average rating of the apps

In [None]:
playstore_df['Rating'].value_counts().plot.bar(figsize = (20,8), color = 'm')
plt.xlabel('Average rating',fontsize = 15 )
plt.ylabel('Number of apps', fontsize = 15)
plt.title('Average rating of apps in Playstore', fontsize = 20)
plt.legend()

* We can represent the ratings in a better way if we group the ratings between certain intervals. Here, we can group the rating as follows:

1. 4-5: Top rated
2. 3-4: Above average
3. 2-3: Average
4. 1-2: Below average

In [None]:
# Defining a function grouped_rating to group the ratings as mentioned above
def rating_app(val):
    if val >=4:
        return 'Top rated'
    elif val>=3 and val <4:
        return 'Above average'
    elif val>=2 and val<3:
        return 'Average'
    else:
        return 'Below average'

In [None]:
playstore_df['Rating_Groups'] = playstore_df['Rating'].apply(lambda x: rating_app(x))

In [None]:
# Average app ratings
playstore_df['Rating_Groups'].value_counts().plot.bar(figsize=(15,5), color = 'royalblue')
plt.xlabel('Rating Group', fontsize = 12)
plt.ylabel('Number of apps', fontsize = 12)
plt.title('Average app ratings', fontsize = 18)
plt.xticks(rotation=0)
plt.legend()

## Chart 7. What are the Top 10 installed apps in any category?

In [None]:
def top_10_in_category(str):
    str = str.upper()
    top_10 = playstore_df[playstore_df['Category'] == str]
    top_10_apps = top_10.sort_values(by='Installs', ascending=False).head(10)
    plt.figure(figsize=(15,6), dpi=100)
    plt.title('Top 10 Installed Apps',size = 20)
    graph = sns.barplot(x = top_10_apps.App, y = top_10_apps.Installs, palette= "icefire")
    graph.set_xticklabels(graph.get_xticklabels(), rotation= 45, horizontalalignment='right')

In [None]:
top_10_in_category('game')

In [None]:
top_10_in_category('family')

**Findings**

From the above graph we can see that in the Game category Subway Surfers,Candy Crush Saga, Temple Run 2 has the highest installs. In the same way we by passing different category names to the function, we can get the top 10 installed apps.

## Chart 8: Top apps that are of free type?

In [None]:
free_df = playstore_df[playstore_df['Type'] == 'Free']

In [None]:
top_10_free_app = free_df[free_df['Installs'] == free_df['Installs'].max()]

In [None]:
top_10_free_app['App']

In [None]:
top_10_free_app['Category'].value_counts().plot.bar(figsize=(20,6), color= ('darkcyan','blueviolet'))
plt.xlabel('Category', size=15)
plt.ylabel('Number of apps', size=15)
plt.title('Categories in which the top 20 free apps belong', size=19)
plt.xticks(rotation=45)
plt.legend()


## Chart 9 : Top apps that are of paid type.

In [None]:
paid_df = playstore_df[playstore_df['Type'] == 'Paid']

In [None]:
paid_df.groupby('Price')['App'].count().sort_values(ascending= False).plot.bar(figsize = (30,6), color = 'crimson')

**Findings**

* The paid apps charge the users a certain amount to download and install the app. This amount varies from one app to another.
* There are a lot of apps that charge a small amount whereas some apps charge a larger amount. In this case the price to download an app varies from USD 0.99 to USD 400.
* In order to select the top paid apps, it won't be fair to look just into the numer of installs. This is because the apps that charge a lower installation fee will be installed by more number of people in general.
* Here a better way to determine the top apps in the paid category is by finding the revenue it generated through app installs.
* This is given by:

 Revenue generated through installs = (Number of installs)x(Price to install the app)


**Create new column Revenue in the paid_df to visualize**

In [None]:
paid_df['Revenue'] = paid_df['Price']* paid_df['Installs']
paid_df.head()

In [None]:
paid_df[paid_df['Revenue'] == paid_df['Revenue'].max()]

In [None]:
# Top 10 paid apps in the play store
top10paid_apps=paid_df.nlargest(10, 'Revenue', keep='first')
top10paid_apps['App']

In [None]:
# Categories in which the top 10 paid apps belong to
top10paid_apps['Category'].value_counts().plot.bar(figsize=(15,5), color= ["orange", "red", "green", "blue", "purple"])
plt.xlabel('Category',size=12)
plt.ylabel('Number of apps',size=12)
plt.title('Categories in which the top 10 paid apps belong', size=15)
plt.xticks(rotation=0)
plt.legend()


In [None]:
# Top paid apps according to the revenue generated through installs alone
top10paid_apps.groupby('App')['Revenue'].mean().sort_values(ascending= True).plot.barh(figsize=(16,10), color='darkorange')
plt.xlabel('Revenue Generated (USD)', size=15)
plt.title('Top apps based on revenue generated through installation fee', size=20)
plt.legend()

## Chart 10 : Does Last Update date has an effects on rating?

In [None]:
print(playstore_df['Last Updated'].head())
#fetch update year from date
playstore_df["Update year"] = playstore_df["Last Updated"].apply(lambda x: x.strftime('%Y')).astype('int64')

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
sns.lineplot(x="Update year", y="Rating", data=playstore_df)
plt.title("Update Year VS Rating")

# Data Visualization on User Reviews:

## Chart 1. Percentage of Review Sentiments

In [None]:
import matplotlib
counts = list(user_df['Sentiment'].value_counts())
labels = 'Positive Reviews', 'Negative Reviews','Neutral Reviews'
matplotlib.rcParams['font.size'] = 20
matplotlib.rcParams['figure.figsize'] = (10, 15)
plt.pie(counts, labels=labels, explode=[0.01, 0.05, 0.05], shadow=True, autopct="%.2f%%")
plt.title('Percentage of Review Sentiments', fontsize=20)
plt.axis('off')
plt.legend(bbox_to_anchor=(0.9, 0, 0.5, 1))
plt.show()

**Findings**


1. Positive reviews are **64.30%**
2. Negative reviews are **22.80%**
3. Neutral reviews are **12.90%**

## Chart 2.  Apps with the highest number of positive reviews

In [None]:
positive_user_df = user_df[user_df['Sentiment'] == 'Positive']

In [None]:
positive_user_df.groupby('App')['Sentiment'].value_counts().nlargest(10).plot.barh(figsize=(10,8),color='seagreen').invert_yaxis()
plt.title("Top 10 positive review apps")
plt.xlabel('Total number of positive reviews')
plt.legend()

**Findings**

Helix Jump is the app who has highest positive sentiment.

## Chart 3. Apps with the highest number of negative reviews.

In [None]:
negative_user_df = user_df[user_df['Sentiment'] == 'Negative']

In [None]:
negative_user_df.groupby('App')['Sentiment'].value_counts().nlargest(10).plot.barh(figsize=(10,8),color='red').invert_yaxis()
plt.title("Top 10 positive review apps")
plt.xlabel('Total number of positive reviews')
plt.legend()

**Findings**

Angry Bird Classic has the highest negative sentiment.

# **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
1. Developing apps related to the least categories as they are not explored much. Like events and beauty.
2. Most of the apps are Free, so focusing on free app is more important.
3. Focusing more on content available for Everyone will increase the chances of getting the highest installs.
4. They need to focus on updating their apps regularly, so that it will attract more users.
5. They need to keep in mind that the sentiments of the user keep varying as they keep using the app, so they should focus more on users needs and features.

#  **6. Conclusion**

In this project of analyzing play store applications, we have worked on several parameters which would help Business to do well in launching their apps on the play store.

In the initial phase, we focused more on the problem statements and data cleaning, in order to ensure that we give them the best results out of our analysis.

Next, we analysed the data on the certain patterns using the graphs and data visualization libraries. Following the questions to analyse, why did we use specific charts, what insights we got and how they impact on business.

Finally, we end to summarise our finding and suggest what needs to be done in order to achieve business.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***