# Hello everyone! I'm going to be focusing on Data Engineering, Visualization, and Prediction within the Android Permission Dataset. 

**Data Engineering:** 
1. How can we deal with the categorical variables in the features?
2. What can we do to get the most out of the descriptions and names of the apps?
3. Do we need all of these features or can we get rid of some of them to get a faster model?

**Data Visualizations:**
1. What are some patterns within the features that may give us a better idea of how to organize the data?
2. Are the classes balanced?
3. How well do our features correlate with other features?

**Data Prediction:**
1. Which model has the highest accuracy out of all the data?
2. Is there anything specific (false positive or false negative) that the model is having a hard time identifying?

Thank you for reading through this notebook and leave a like if you enjoyed it!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/android-permission-dataset/Android_Permission.csv')
df.head()

In [None]:
df.shape

# Useful Columns

A lot of these column may not be necessary, especially the ones after the first 8, lets check out the value for more of these columns to see how necessary they actually are.

To figure this out I'm gonna drop the columns which have less that 1000 "1's" because if it has less then almost all the apps have the same feature though.


In [None]:
dropper = []
for col in df.columns[10:]:
    if (df[col].value_counts()[0] == 29999 or df[col].value_counts()[1] < 1000):
        dropper.append(col)

df = df.drop(df[dropper], axis = 1)
df = df.drop('Related apps', axis = 1)

In [None]:
df.head()

In [None]:
df.isnull().sum()

# Missing Values

There are very little missing values so I'm just going to drop all the rows with missing values

In [None]:
df = df.dropna()

In [None]:
df[df.columns[0:4]].head()

# Text Columns

Discluding the Category Column, the rest of these columns are text type of columns which means we need to find someway to make them into numerical types of columns. To do so I will take common patterns in malign types of apps and try and see if I can create columns that can interpret these patterns.


# Things to take in account

1. Amount of Uppercase letters in all 3 columns
2. Amount of periods in Package
3. Spelling errors in Description
4. Length of words in Description
5. Words like Free or Better in App


In [None]:
df['App_Upper'] = df['App'].apply(lambda message: sum(1 for c in str(message) if c.isupper()))
df['Pack_Upper'] = df['Package'].apply(lambda message: sum(1 for c in str(message) if c.isupper()))
df['Description_Upper'] = df['Description'].apply(lambda message: sum(1 for c in str(message) if c.isupper()))
df['Pack_Periods'] = df['Package'].apply(lambda message: sum(1 for c in str(message) if '.' in c))
df['Desc_len'] = df['Description'].apply(lambda message: len(str(message).split()))

In [None]:
#from spellchecker import SpellChecker
#import re

#spell = SpellChecker()
#def misspelled(word):
#    new_word = re.sub(r'<.+?>', '', str(word))
#    miss = spell.unknown(re.findall(r"[\w']+|[.,!?;]", new_word))
#    return len(miss)

#df['Desc_Misspelled'] = np.vectorize(misspelled)(df['Description'])


In [None]:
df['App_Free'] = df['App'].str.contains('free').astype(int)

# Data Visualizations

Okay I think I'm about done doing data engineering, lets do some data visualizations now.


In [None]:
df = df.drop(['App', 'Package', 'Description'], axis = 1)

In [None]:
sns.countplot(data = df, x = 'Class')

So first thing to note is that the classes do seem to be unbalanced with a lot more scam apps than normal ones, this may cause our model to identify more apps as scams even if they aren't. To solve this issue we can stratify our data when we are splitting it.

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_cat = df.copy()
df['Category'] = le.fit_transform(df['Category'])
plt.figure(dpi = 130)
sns.heatmap(df.corr())

Looking at the correlation heatmap, our objective 'Class' does not seem to be highly correlated with any other variables. In fact, the only features that really correlate with any other feature are "Dangerous Permission Count" and "Safe Permission Count"

Perhaps the correlation chart is not showng the full picture, lets look at relations between variables with other plots.

In [None]:
plt.figure(figsize = (9, 7), dpi = 100)
sns.scatterplot(data = df, x = 'Rating', y = 'Price')

Interestingly enough there seems to be very little correlation between the rating and the price of the app. One would expect that as the rating goes up, price goes down but it seems that people care much more about other variables such as gameplay rather than the price.

In [None]:
plt.figure(figsize = (11, 7), dpi = 100);
sns.barplot(data = df_cat, x = 'Category', y = 'Price', estimator=np.mean, ci = None)
plt.xticks(rotation = 90);

So it seems that the "Medical" Category on average has games that end up costing significantly higher than any other category, second (but not even that close) is "Travel and Local" which is kind of ironic because it means you have to pay just to download these apps before you even decide where to travel to.

I do want to explore the "Medical" Category a bit more though and see why it costs so much.

In [None]:
df_cat[(df_cat['Category'] == 'Medical') & (df_cat['Price'] > 10)][['Category', 'Price', 'Class']]

Oh lol, it seems that all the categories in the "Medical" section with prices > $10 are just scams. Which when you think about it kind of makes sense considering "Medical Apps" shouldn't cost a lot of money before you even download them.

Speaking of Prices, let's see if increases in prices lead to there being a higher result of the app being a scam.

In [None]:
plt.figure(figsize=(10, 7), dpi = 150)
sns.boxplot(data = df, x = 'Class', y = 'Price')

So it seems that for the most part, most of the games, whether a scam or not, tend to have no price. However, there are many outliers suggesting that scam games do put higher prices on their apps to try and trick people into buying them.

In fact, it seems like no games that are real actually have any price to them, lets confirm this.

In [None]:
df[(df['Price'] > 0) & (df['Class'] == 0)]

Wow, so it seems that from this dataset there are no games that have a price > 0 and are not scams.

This is interesting because in terms of the model it is both good and bad in that I can create a new column called Price > 0 which will help the model understand that any price greater than 0 results in the game being a scam. At the same time though, if this model were to take in new data where, say there is a game which is not a scam has a price over $0, the model will most likely classify it as a scam. I am going to implement the column in this case because we are not taking in new data and I want to achieve the highest accuracy possible without overfitting.

In terms of real life this means that most games (on the Android) you're going to find on the appstore will most likely be free when first downloading the game and if it does have a price, it is most likely a scam.

In [None]:
df['Price > 0'] = df['Price'].apply(lambda x: 1 if x > 0 else 0)

I don't exactly understand what the "Dangerous permissions count" and "Safe permissions count" columns are so lets take a closer look at how they differ between Classes.

In [None]:
plt.figure(figsize=(9, 7), dpi = 100)
sns.boxplot(data = df, x = 'Class', y = 'Dangerous permissions count')

In [None]:
df[df['Dangerous permissions count'] > 10]['Class'].value_counts()

Um okay well the Dangerous Permissions count column seems to be almost completely useless according to the boxplot considering both Classes have about the same median, upper quartile, lower quartile, and even outliers in the same place. However, when looking at the amount of "Dangerous permission count" > 10 there does seem to be a lot more scam games in there (then again there are more scam games then real games as well).

In [None]:
plt.figure(figsize=(9, 7), dpi = 100)
sns.boxplot(data = df, x = 'Class', y = 'Safe permissions count')

In [None]:
df[df['Safe permissions count'] > 5]['Class'].value_counts()

Scam games also seem to have a higher amount of "Safe Permission Count" then real games as well which doesn't make much sense considering you wouldn't expect scams to have safe permissions but, again, it's a very vague column that I can't really understand the point of.

Okay I've decided I'm still going to keep the Permissions columns because they might end up being useful. I'm also now going to move on to doing machine learning as well. 

In [None]:
df.head()

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X = df.drop('Class', axis = 1)
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify= y, test_size = 0.2, random_state = 101)

In [None]:
ct = ColumnTransformer([('st', StandardScaler(), ['Rating', 'Number of ratings', 'Price', 'Dangerous permissions count', 'Safe permissions count'])], remainder='passthrough')
ct.fit(X_train)
X_train_sc = ct.transform(X_train)
X_test_sc = ct.transform(X_test)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

In [None]:
rc = RidgeClassifier(alpha = 0, class_weight=None, solver = 'cholesky')
rc.fit(X_train_sc, y_train)

In [None]:
y_pred_rc = rc.predict(X_test_sc)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix

In [None]:
print(classification_report(y_test, y_pred_rc))

In [None]:
plot_confusion_matrix(rc, X_test_sc, y_test)

In [None]:
ls = LinearSVC(dual = False, C = 0.6)
ls.fit(X_train_sc, y_train)

In [None]:
y_pred_ls = ls.predict(X_test_sc)

In [None]:
print(classification_report(y_test, y_pred_ls))

In [None]:
plot_confusion_matrix(ls, X_test_sc, y_test)

In [None]:
rc = RandomForestClassifier(n_estimators=200)
rc.fit(X_train, y_train)

In [None]:
y_pred_rc = rc.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred_rc))

In [None]:
plot_confusion_matrix(rc, X_test, y_test)

In [None]:
gc = GradientBoostingClassifier(max_depth = 4, n_estimators = 200)
gc.fit(X_train, y_train)
y_pred_gc = gc.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred_gc))

In [None]:
plot_confusion_matrix(gc, X_test, y_test)

So all in all it seems that the Gradient Boosting algorithm did best with an accuracy of 82% at identifying malicious apps. Thank you for going through this notebook and have a great day!