<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:130%; text-align:left">

<h2 align="left"><font color=#ff6200>Introduction:</font></h2>

Data Analysis and Ratings Prediction for Apps on Google Play Store

<img src="https://www.androidheadlines.com/wp-content/uploads/2012/11/Google-Play-02.webp" width="2400">

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:130%; text-align:left">

<h2 align="left"><font color=#ff6200>About data:</font></h2>


**App :** The name of the app

**Category :** The category of the app

**Rating :** The rating of the app in the Play Store

**Reviews :** The number of reviews of the app

**Size :** The size of the app

**Install :** The number of installs of the app

**Type :** The type of the app (Free/Paid)

**Price :** The price of the app (0 if it is Free)

**Content Rating :** The appropiate target audience of the app

**Genres:** The genre of the app

**Last Updated :** The date when the app was last updated

**Current Ver :** The current version of the app

**Android Ver :** The minimum Android version required to run the app

<h2 align="left"><font color=#ff6200>Let's get started:</font></h2>

<a id="setup"></a>
# <p style="background-color: #ff6200; font-family:calibri; color:white; font-size:140%; font-family:Verdana; text-align:center; border-radius:15px 50px;">Step 1 | Setup and Initialization</p>

<a id="libraries"></a>
# <b><span style='color:#fcc36d'>Step 1.1 |</span><span style='color:#ff6200'> Importing Necessary Libraries</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
First of all, I will import all the necessary libraries that we will use throughout the project. This generally includes libraries for data manipulation, data visualization, and others based on the specific needs of the project:

In [None]:
# Data
import numpy as np
import pandas as pd
from collections import defaultdict

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msn
from wordcloud import WordCloud

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

# Regression
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

# Classification
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error


# Hide warnings
import warnings
warnings.filterwarnings('ignore')


<a id="load_dataset"></a>
# <b><span style='color:#fcc36d'>Step 1.2 |</span><span style='color:#ff6200'> Loading the Dataset</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
Next, I will load the dataset into a pandas DataFrame which will facilitate easy manipulation and analysis:

In [None]:
df = pd.read_csv("/kaggle/input/google-play-store-apps/googleplaystore.csv")

<a id="initial_analysis"></a>
# <p style="background-color: #ff6200; font-family:calibri; color:white; font-size:140%; font-family:Verdana; text-align:center; border-radius:15px 50px;">Step 2 | Initial Data Analysis</p>

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
Afterward, I am going to gain a thorough understanding of the dataset before proceeding to the data cleaning and transformation stages.

<a id="overview"></a>
# <b><span style='color:#fcc36d'>Step 2.1 |</span><span style='color:#ff6200'> Dataset Overview</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

First I will perform a preliminary analysis to understand the structure and types of data columns:

In [None]:
df.sample(5)

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.shape

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
**As we can see we have data of 10841 applications consisting of 13 attributes.**

In [None]:
df.columns

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
**Here we can see that only Rating column is only in float, so we need to convert numerical columns into int and float.**  

In [None]:
df.info()

<a id="initial_analysis"></a>
# <p style="background-color: #ff6200; font-family:calibri; color:white; font-size:140%; font-family:Verdana; text-align:center; border-radius:15px 50px;">Step 3 | Preprocessing</p>

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
**As most of the features are set to data type object and have suffixes, each feature's data type must be converted into a suitable format for analysis.**

### Checking if all values in number of Reviews numeric
    
<a id="monetary"></a>
## <b><span style='color:#fcc36d'>Step 3.1 |</span><span style='color:#ff6200'> Reviews</span></b>

In [None]:
df[~df.Reviews.str.isnumeric()]

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
We could have converted it into integer like we did for Size but the data for this App looks different. It can be noticed that the entries are entered wrong We could fix it by setting Category as nan and shifting all the values, but deleting the sample for now.

In [None]:
df=df.drop(df.index[10472])

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
**The feature Reviews must be of integer type.**

In [None]:
df["Reviews"] = df["Reviews"].astype(int)


In [None]:
df.info()

<a id="monetary"></a>
## <b><span style='color:#fcc36d'>Step 3.2 |</span><span style='color:#ff6200'> Size</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
* It can be seen that data has metric prefixes(kilo and Mega) along with another string.Replacing k and M with their values to convert values to numeric.

* The feature Size must be of floating type.
* The suffix, which is a size unit, must be removed. \ Example: '19.2M' to 19.2

* If size is given as 'Varies with device' we replace it with 0

* The converted floating values of Size is represented in megabytes units.


In [None]:
df['Size'].unique()

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

* Remove all characters from size and convert it to float.



In [None]:
df['Size']=df['Size'].str.replace('M','000')
df['Size']=df['Size'].str.replace('k','')
#apps['size']=apps['size'].str.replace('.','')
df['Size']=df['Size'].replace("Varies with device",np.nan)
df['Size']=df['Size'].astype('float')
df['Size']

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
* There is a problem! There are some applications size in megabyte and some in kilobyte.

In [None]:
###### Convert mega to kilo then convert all to mega
for i in df['Size']:
    if i < 10:
        df['Size']=df['Size'].replace(i,i*1000)
df['Size']=df['Size']/1000
df['Size']

In [None]:
df.info()

<a id="monetary"></a>
## <b><span style='color:#fcc36d'>Step 3.3 |</span><span style='color:#ff6200'> Installs and Price</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
* The feature Installs must be of integer type.

* The characters ',' and '+' must be removed. \ Example: '10,000+' to 10000

* The feature Price must be of floating type.

* The suffix '\$' must be removed if Price is non-zero. \ Example: '$4.99' to 4.99

In [None]:
df['Installs'].unique()

In [None]:
df['Price'].unique()

In [None]:
items_to_remove=['+',',','$']
cols_to_clean=['Installs','Price']
for item in items_to_remove:
    for col in cols_to_clean:
        df[col]=df[col].str.replace(item,'')
df.head()

In [None]:
df.Installs.unique()

In [None]:
df['Price'].unique()

In [None]:
df[df['Price']=='Everyone']

In [None]:
df['Installs']=df['Installs'].astype('int')
df['Price']=df['Price'].astype('float')
df.info()

<a id="monetary"></a>
## <b><span style='color:#fcc36d'>Step 3.3 |</span><span style='color:#ff6200'> last updated</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
* Updating the Last Updated column's datatype from string to pandas datetime.

* Extracting new columns Updated Year, Updated Month and updated day.

In [None]:
#### Change Last update into a datetime column
df['Last Updated'] = pd.to_datetime(df['Last Updated'])
df['Last Updated']

In [None]:
df['Updated_Month']=df['Last Updated'].dt.month
df['Updated_Year']=df['Last Updated'].dt.year

In [None]:
df.drop('Last Updated', axis=1, inplace=True)

In [None]:
df.head()

In [None]:
df.info()

<a id="initial_analysis"></a>
# <p style="background-color: #ff6200; font-family:calibri; color:white; font-size:140%; font-family:Verdana; text-align:center; border-radius:15px 50px;">Step 4 | Data cleaning</p>

In [None]:
null = pd.DataFrame({'Null Values' : df.isna().sum().sort_values(ascending=False), 'Percentage Null Values' : (df.isna().sum().sort_values(ascending=False)) / (df.shape[0]) * (100)})
null

In [None]:
null_counts = df.isna().sum().sort_values(ascending=False)/len(df)
plt.figure(figsize=(16,8))
plt.xticks(np.arange(len(null_counts))+0.5,null_counts.index,rotation='vertical')
plt.ylabel('fraction of rows with missing data')
plt.bar(np.arange(len(null_counts)),null_counts)

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

**Its clear that we have missing values in Rating, Type, Content Rating, Current Ver and Android Ver.**

<a id="monetary"></a>
## <b><span style='color:#fcc36d'>Step 4.1 |</span><span style='color:#ff6200'> Handling missing values</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

**I Clean missing values using Random Value Imputation Because This the best way to To maintain distrbuation For each feature.**

In [None]:
def impute_median(series):
    return series.fillna(series.median())

df['Rating'] = df['Rating'].transform(impute_median)

In [None]:
df.info()

In [None]:
def impute_median(series):
    return series.fillna(series.median())

df['Size'] = df['Size'].transform(impute_median)

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df['Type'].fillna(str(df['Type'].mode().values[0]),inplace=True)

In [None]:
df.isnull().sum()

<a id="monetary"></a>
## <b><span style='color:#fcc36d'>Step 4.2 |</span><span style='color:#ff6200'> Delete duplicated data</span></b>

In [None]:
duplicate = df.duplicated()
print(duplicate.sum())

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
duplicate = df.duplicated()
print(duplicate.sum())

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
Extract Numerical and categorical features

In [None]:
num_features=[col for col in df.columns if df[col].dtype!='O']
num_features

In [None]:
cat_features=[col for col in df.columns if df[col].dtype=='O']
cat_features

<a id="monetary"></a>
## <b><span style='color:#fcc36d'>Step 4.3 |</span><span style='color:#ff6200'> Check outliers</span></b>

In [None]:
sns.boxplot(df["Rating"])

In [None]:
sns.boxplot(df["Rating"])

In [None]:
sns.boxplot(df["Size"])

In [None]:
sns.boxplot(df["Installs"])

In [None]:
sns.boxplot(df["Rating"])

In [None]:
sns.boxplot(df["Price"])

<a id="initial_analysis"></a>
# <p style="background-color: #ff6200; font-family:calibri; color:white; font-size:140%; font-family:Verdana; text-align:center; border-radius:15px 50px;">Step 5 | Exploratory Data Analysis (EDA)</p>

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

## Category Column

In [None]:
df['Category'].value_counts()

In [None]:
plt.rcParams['figure.figsize'] = (20, 10)
sns.countplot(x='Category',data=df)
plt.xticks(rotation=70)

In [None]:
plt.subplots(figsize=(25,15))
wordcloud = WordCloud(
                          background_color='black',
                          width=1920,
                          height=1080
                         ).generate(" ".join(df.Category))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

## Category vs Rating Analysis

In [None]:
plt.figure(figsize=(20,15))
sns.boxplot(y='Rating',x='Category',data = df.sort_values('Rating',ascending=False))
plt.xticks(rotation=80)

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

## Type Column

In [None]:
df['Type'].value_counts()

In [None]:
plt.rcParams['figure.figsize'] = (8,5)
sns.countplot(x='Type',data=df)
plt.xticks(rotation=70)

In [None]:
df["Type"].value_counts().plot.pie(autopct = "%1.1f%%")

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

## Type vs Rating Analysis

In [None]:
plt.figure(figsize=(15,8))
sns.catplot(y='Rating',x='Type',data = df.sort_values('Rating',ascending=False),kind='boxen')

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

## Content Rating Column

In [None]:
df['Content Rating'].value_counts()

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

## Content Rating vs Rating Analysis

In [None]:
plt.figure(figsize=(12,8))
sns.boxplot(y='Rating',x='Content Rating',data = df.sort_values('Rating',ascending=False))
plt.xticks(rotation=90)

In [None]:
plt.figure(figsize=(12,8))
sns.barplot(x="Content Rating", y="Installs", hue="Type", data=df)

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

## Genres Column

In [None]:
df['Genres'].value_counts()

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

## Current ver Column

In [None]:
df['Current Ver'].value_counts()

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

## Android Ver Column

In [None]:
df['Android Ver'].value_counts()


In [None]:
# Function to create a scatter plot
def scatters(col1, col2):
    # Create a scatter plot using Seaborn
    plt.figure(figsize=(10, 6))  # Adjust the figure size as needed
    sns.scatterplot(data=df, x=col1, y=col2, hue="Type")
    plt.title(f'Scatter Plot of {col1} vs {col2}')
    plt.xlabel(col1)
    plt.ylabel(col2)
    plt.show()

# Function to create a KDE plot
def kde_plot(feature):
    # Create a FacetGrid for KDE plots using Seaborn
    grid = sns.FacetGrid(df, hue="Type", aspect=2)

    # Map KDE plots for the specified feature
    grid.map(sns.kdeplot, feature)

    # Add a legend to distinguish between categories
    grid.add_legend()

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

## kde-Plot Analysis

In [None]:
kde_plot('Rating')

In [None]:
kde_plot('Size')

In [None]:
kde_plot('Updated_Month')

In [None]:
kde_plot('Price')

In [None]:
kde_plot('Updated_Year')

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

## Scatter plot Analysis

In [None]:
scatters('Price', 'Updated_Year')

In [None]:
scatters('Size', 'Rating')

In [None]:
scatters('Size', 'Installs')

In [None]:
scatters('Updated_Month', 'Installs')

In [None]:
scatters('Reviews', 'Rating')

In [None]:
scatters('Rating', 'Price')

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

## Further Analysis

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

### Apps with a 5.0 Rating

In [None]:
df_rating_5 = df[df.Rating == 5.]
print(f'There are {df_rating_5.shape[0]} apps having rating of 5.0')

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

### Installs

In [None]:
sns.histplot(data=df_rating_5, x='Installs', kde=True, bins=50)

plt.title('Distribution of Installs with 5.0 Rating Apps')
plt.show()

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

**Despite the full ratings, the number of installations for the majority of the apps is low. Hence, those apps cannot be considered the best products.**

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

### Reviews

In [None]:
sns.histplot(data=df_rating_5, x='Reviews', kde=True)
plt.title('Distribution of Reviews with 5.0 Rating Apps')
plt.show()

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

**The distribution is right-skewed which shows applications with few reviews having 5.0 ratings, which is misleading.**

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

### Category

In [None]:
df_rating_5_cat =  df_rating_5['Category'].value_counts().reset_index()

In [None]:
# Create a pie chart
plt.figure(figsize=(8, 6))
sns.set(style="whitegrid")
plt.pie(df_rating_5_cat.iloc[:, 1], labels=df_rating_5_cat.iloc[:, 0], autopct='%1.1f%%')
plt.title('Pie chart of App Categories with 5.0 Rating')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

# Show the pie chart
plt.show()

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

**Family, LifeStyle and Medical apps receive the most 5.0 ratings on Google Play Store with Family representing about quater of whole.**

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

### Type

In [None]:
df_rating_5_type =  df_rating_5['Type'].value_counts().reset_index()

In [None]:
# Create a pie chart
plt.figure(figsize=(8, 6))
sns.set(style="whitegrid")

# Data for the pie chart
sizes = df_rating_5_type.iloc[:, 1]
labels = df_rating_5_type.iloc[:, 0]

# Pull a slice out by exploding it
explode = (0, 0.1)  # Adjust the second value to control the pull-out distance

# Create the pie chart with default colors
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140, pctdistance=0.85, explode=explode)

# Draw a circle in the center to make it look like a donut chart
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

# Equal aspect ratio ensures that pie is drawn as a circle.
plt.axis('equal')

# Title
plt.title('Pie chart of App Types with 5.0 Rating')

# Show the pie chart
plt.show()


<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

**Almost 90% of the 5.0 rating apps are free on Goolge Play Store.**

In [None]:
freq= pd.Series()
freq=df['Updated_Year'].value_counts()
freq.plot()
plt.xlabel("Dates")
plt.ylabel("Number of updates")
plt.title("Time series plot of Last Updates")

<a id="initial_analysis"></a>
# <p style="background-color: #ff6200; font-family:calibri; color:white; font-size:140%; font-family:Verdana; text-align:center; border-radius:15px 50px;"> Feature Pruning</p>

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">

We decide to prune the following features:

* App : App names are of no value for the model
* Genres : The informations it stores is same as the feature Category
* Current Ver : Current Version of an app doesn't hold significant value.
* Android Ver: Android Version of an app doesn't hold significant value.

In [None]:
pruned_features = ['App', 'Genres', 'Current Ver', 'Android Ver']

<a id="initial_analysis"></a>
# <p style="background-color: #ff6200; font-family:calibri; color:white; font-size:140%; font-family:Verdana; text-align:center; border-radius:15px 50px;">Step 6 | Data Splitting for Modeling</p>

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
### We split the dataset into 80% train and 20% test.

In [None]:
target = 'Rating'

In [None]:
X = df.copy().drop(pruned_features+[target], axis=1)
y = df.copy()[target]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

In [None]:
X_train

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
## Label Encoding

In [None]:
le_dict = defaultdict()


In [None]:
features_to_encode = X_train.select_dtypes(include=['category', 'object']).columns

for col in features_to_encode:
    le = LabelEncoder()

    X_train[col] = le.fit_transform(X_train[col]) # Fitting and tranforming the Train data
    X_train[col] = X_train[col].astype('category') # Converting the label encoded features from numerical back to categorical dtype in pandas

    X_test[col] = le.transform(X_test[col]) # Only transforming the test data
    X_test[col] = X_test[col].astype('category') # Converting the label encoded features from numerical back to categorical dtype in pandas

    le_dict[col] = le # Saving the label encoder for individual features

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
## Standardization

In [None]:
# Converting and adding "Last Updated Month" to categorical features
categorical_features = features_to_encode + ['Updated_Month']
X_train['Updated_Month'] = X_train['Updated_Month'].astype('category')
X_test['Updated_Month'] = X_test['Updated_Month'].astype('category')

# Listing numeric features to scale
numeric_features = X_train.select_dtypes(exclude=['category', 'object']).columns

In [None]:
scaler = StandardScaler()

# Fitting and transforming the Training data
X_train[numeric_features] = scaler.fit_transform(X_train[numeric_features])
# X_train = scaler.fit_transform(X_train)

# Only transforming the Test data
X_test[numeric_features] = scaler.transform(X_test[numeric_features])
# X_test = scaler.transform(X_test)

In [None]:
X_train[numeric_features]

In [None]:
X_train.head()

In [None]:
y_train.head()

<a id="initial_analysis"></a>
# <p style="background-color: #ff6200; font-family:calibri; color:white; font-size:140%; font-family:Verdana; text-align:center; border-radius:15px 50px;">Step 7 | Modeling</p>

<a id="monetary"></a>
## <b><span style='color:#fcc36d'>Step 7.1 |</span><span style='color:#ff6200'> Regression</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
### Creating dataframe for metrics

In [None]:
models = ['Linear', 'KNN', 'Random Forest']
datasets = ['train', 'test']
metrics = ['RMSE', 'MAE', 'R2']

multi_index = pd.MultiIndex.from_product([models, datasets, metrics],
                                         names=['model', 'dataset', 'metric'])

df_metrics_reg = pd.DataFrame(index=multi_index,
                          columns=['value'])

In [None]:
df_metrics_reg

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
## Linear Regressor

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
df_metrics_reg.loc['Linear', 'train', 'R2'] = lr.score(X_train, y_train)
df_metrics_reg.loc['Linear', 'test', 'R2'] = lr.score(X_test, y_test)

In [None]:
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

df_metrics_reg.loc['Linear', 'train', 'MAE'] = mean_absolute_error(y_train, y_train_pred)
df_metrics_reg.loc['Linear', 'test', 'MAE'] = mean_absolute_error(y_test, y_test_pred)

df_metrics_reg.loc['Linear', 'train', 'RMSE'] = mean_squared_error(y_train, y_train_pred, squared=False)
df_metrics_reg.loc['Linear', 'test', 'RMSE'] = mean_squared_error(y_test, y_test_pred, squared=False)

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
### KNeighbors Regressor

In [None]:
knn = KNeighborsRegressor()
knn.fit(X_train, y_train)

In [None]:
df_metrics_reg.loc['KNN', 'train', 'R2'] = knn.score(X_train, y_train)
df_metrics_reg.loc['KNN', 'test', 'R2'] = knn.score(X_test, y_test)

In [None]:
y_train_pred = knn.predict(X_train)
y_test_pred = knn.predict(X_test)

df_metrics_reg.loc['KNN', 'train', 'MAE'] = mean_absolute_error(y_train, y_train_pred)
df_metrics_reg.loc['KNN', 'test', 'MAE'] = mean_absolute_error(y_test, y_test_pred)

df_metrics_reg.loc['KNN', 'train', 'RMSE'] = mean_squared_error(y_train, y_train_pred, squared=False)
df_metrics_reg.loc['KNN', 'test', 'RMSE'] = mean_squared_error(y_test, y_test_pred, squared=False)

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
### Random Forest Regressor

In [None]:
rf = RandomForestRegressor(max_depth=2, random_state=0)
rf.fit(X_train, y_train)

In [None]:
df_metrics_reg.loc['Random Forest', 'train', 'R2'] = rf.score(X_train, y_train)
df_metrics_reg.loc['Random Forest', 'test', 'R2'] = rf.score(X_test, y_test)

In [None]:
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)

df_metrics_reg.loc['Random Forest', 'train', 'MAE'] = mean_absolute_error(y_train, y_train_pred)
df_metrics_reg.loc['Random Forest', 'test', 'MAE'] = mean_absolute_error(y_test, y_test_pred)

df_metrics_reg.loc['Random Forest', 'train', 'RMSE'] = mean_squared_error(y_train, y_train_pred, squared=False)
df_metrics_reg.loc['Random Forest', 'test', 'RMSE'] = mean_squared_error(y_test, y_test_pred, squared=False)

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
### Regression Evaluation

In [None]:
# Rounding the values

df_metrics_reg['value'] = df_metrics_reg['value'].apply(lambda v: round(v, ndigits=3))
df_metrics_reg

In [None]:
data = df_metrics_reg.reset_index()

g = sns.catplot(col='dataset', data=data, kind='bar', x='model', y='value', hue='metric')

# Adding annotations to bars
# iterate through axes
for ax in g.axes.ravel():
    # add annotations
    for c in ax.containers:
        ax.bar_label(c, label_type='edge')

    ax.margins(y=0.2)

plt.show()

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
* **The Regression predictions don't hold up very well!**

* **We can interpret that the dataset is not suitable for regression problem.**

<a id="monetary"></a>
## <b><span style='color:#fcc36d'>Step 7.2 |</span><span style='color:#ff6200'> Classification</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    

### Let's frame it as a classification problem statement.

### Converting the Rating from continuous to discrete

In [None]:
y_train_int = y_train.astype(int)
y_test_int = y_test.astype(int)

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
### Creating dataframe for metrics

In [None]:
models = ['Logistic Regression', 'KNN', 'Random Forest']
datasets = ['train', 'test']

multi_index = pd.MultiIndex.from_product([models, datasets],
                                         names=['model', 'dataset'])

df_metrics_clf = pd.DataFrame(index=multi_index,
                          columns=['accuracy %'])

In [None]:
df_metrics_clf

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
### Logistic Regression Classifier

In [None]:
lr_clf = LogisticRegression()
lr_clf.fit(X_train, y_train_int)

In [None]:
X_train

In [None]:
logreg_model = LogisticRegression()

# Train the model
logreg_model.fit(X_train, y_train_int)

# Make predictions on the test set
predictions = logreg_model.predict(X_test)

# Alternatively, get predicted probabilities
probabilities = logreg_model.predict_proba(X_test)

# Evaluate the model, e.g., using accuracy
accuracy = logreg_model.score(X_test, y_test_int)

print("Accuracy:", accuracy)

In [None]:
# New data point
new_data_point = [[14, -0.143720, 1.428843, -0.160552, 0, -0.061293, 3, 8, 0.559437]]

# Standardize the new data point using the same scaler used for training
# new_data_point_scaled = scaler.transform(new_data_point)

# Predict the class (0 or 1)
predicted_class = logreg_model.predict(new_data_point)

# Get predicted probabilities for each class
predicted_probabilities = logreg_model.predict_proba(new_data_point)

print("Predicted Class:", predicted_class)
print("Predicted Probabilities:", predicted_probabilities)


In [None]:
from sklearn.preprocessing import StandardScaler
new_data_point_unsacle = [[14, 150, 25.0, 1000, 0, 0.0, 3, 1, 2018]]
# Assuming 'scaler' is the StandardScaler instance you used during training
# 'scaler' should have been fit on the training data

# Select the indices of the columns you want to scale
columns_to_scale = [1, 2, 3, 5, 8]  # Assuming indices start from 0

# Extract the columns you want to scale from new_data_point
columns_to_scale_values = [new_data_point_unsacle[0][i] for i in columns_to_scale]

# Reshape the values array as StandardScaler expects a 2D array
scaled_values = scaler.transform([columns_to_scale_values])

# Update the new_data_point with the scaled values
for i, col_index in enumerate(columns_to_scale):
    new_data_point[0][col_index] = scaled_values[0][i]

print("Scaled new_data_point:", new_data_point)
predicted_class = logreg_model.predict(new_data_point)

# Get predicted probabilities for each class
predicted_probabilities = logreg_model.predict_proba(new_data_point)

print("Predicted Class:", predicted_class)
print("Predicted Probabilities:", predicted_probabilities)


In [None]:
df_metrics_clf.loc['Logistic Regression', 'train'] = lr_clf.score(X_train, y_train_int)
df_metrics_clf.loc['Logistic Regression', 'test'] = lr_clf.score(X_test, y_test_int)

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
### KNeighbors Classifier

In [None]:
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train_int)

In [None]:
df_metrics_clf.loc['KNN', 'train'] = knn_clf.score(X_train, y_train_int)
df_metrics_clf.loc['KNN', 'test'] = knn_clf.score(X_test, y_test_int)

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
### Random Forest Classifier

In [None]:
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train, y_train_int)

In [None]:
df_metrics_clf.loc['Random Forest', 'train'] = rf_clf.score(X_train, y_train_int)
df_metrics_clf.loc['Random Forest', 'test'] = rf_clf.score(X_test, y_test_int)

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
### Classification Evaluation

In [None]:
# Rounding and coverting the accuracies to percentages
df_metrics_clf['accuracy %'] = df_metrics_clf['accuracy %'].apply(lambda v: round(v*100, ndigits=2))
df_metrics_clf

In [None]:
data = df_metrics_clf.reset_index()

g = sns.catplot(col='dataset', data=data, kind='bar', x='model', y='accuracy %')

# Adding annotations to bars
# iterate through axes
for ax in g.axes.ravel():
    # add annotations
    for c in ax.containers:
        ax.bar_label(c, label_type='edge')

    ax.margins(y=0.2)

plt.show()

<div style="border-radius:10px; padding: 15px; background-color: #ffeacc; font-size:120%; text-align:left">
    
**After comparing with Regression models, its clear that we would get better results from Classification!**

<a id="monetary"></a>
## </span><span style='color:#ff6200'> Conclusion</span></b>

<div style="border-radius:10px; padding: 15px; background-color: #9d8cd1; font-size:130%; text-align:left">

* In conclusion, the dataset from Google Play Store apps has been explored and analyzed using various data visualization techniques with the help of Matplotlib, Seaborn and Plotly libraries.

* The preliminary analysis, visualization methods and EDA provided insights into the data and helped in understanding the underlying patterns and relationships among the variables.

* The analysis of the Google Play Store dataset has shown that there is a weak correlation between the rating and other app attributes such as size, installs, reviews, and price. We found that there was a moderate positive correlation between the number of installs and the rating, suggesting that higher-rated apps tend to have more installs.

* We also observed that free apps have higher ratings than paid apps, and that app size does not seem to have a significant impact on rating.