# Welcome to Pear Inc. 

Hi there! 
My name is Robert! You can call me Bob 😉. <br>
I'm the communications officer (fancy title ha!) in our glorious company 😇.
My job is to help facilitate product development and market penetration 🤓. <br>
I spent endless hours talking to engineers, product managers, and  customers 😱  

Since we are a 20 people start-up (All of us have fancy names 😂), I also do some recruiting from time to time 💪 <br>
We are looking for **brave souls who are not afraid of a challenge and will help us** with our new product line of smart t-shirts! 🧐 <br> 

(Our CEO believes that smart t-shirts are the right direction for some reason 😅 I guess if you make something nobody needs, you won't have to sell it 🤓) <br>

Let me tell you a little bit more about our problem that you can help us with:<br>
We are creating a life changing smart t-shirt which has bluetooth and connects to your phone 🥳. They will be customizable outfits through downloaded applications. Our smart t-shirt will be developed with Google Wear OS which is a version of Google's Android operating system designed for smartwatches and other wearables. So users will be able to install custom programs through Google Play Store 🤭. <br> And we will sell them for 999.9$ a piece 💰💰💰<br>
But our engineers wanted to ensure that only Pear Inc. approved programs can be installed on our t-shirts because
market analysis showed that potential customers are afraid of ransomware that will break their "*premium*" t-shirts 🤦‍. So we need an antivirus for approving apps on the fly! <br>However, we don't want to install an off the shelf antivirus to our t-shirts 🤫, because BIG profit margins matter 🏦!

##### Enough chit-chat!
Let's get down to the business of why I contacted you: <br>
Our bright engineers came up with an algorithm that creates compressed signatures for the apps in the Google Play Store. It is called '*manifold averaging generally intelligent compressor*' or as we call it 'MAGIC'. <br>
The engineers told us that the outputs of MAGIC reflect the statistical properties of the uncompressed apps (whatever that may mean! 🤦‍). <br> MAGIC takes a Google Play Store app as an input and outputs a 4 dimensional numerical signature (they called it a vector but calling it a vector is not fancy enough for marketing! 🤪).   

Now, since these signatures are just numbers, an off the shelf antivirus can't work with them (even if it could, we can't install an off the shelf antivirus into our t-shirts -- too much computing power and space is needed). Therefore **we need a light weight proof of concept that takes these signatures as inputs and outputs labels (virus or not) for them.** We eventually want to install your program into our smart t-shirts, where it will scan a Google Play Store app (its signature to be precise!) and stop the app's execution if it thinks the app is a virus! But we are not going so far just yet so you only need to create the pipeline that take the signatures, and output labels for them. Don't worry about the rest, it is just a proof of concept at the end 😉. We are providing the dataset for you to develop your model.

In a nutshell: 
- There 4 dimensional (4 feature) numerical inputs (signatures) with labels!
- We need a simple model that takes these inputs and labels them (Virus, Not a Virus)
- We also need you to evaluate your model. Choose any metric you want, but don't forget to explain why, since I don't know much about this field (that is why we need your help!)

Things to keep in mind:
- There are less 'Virus' in the dataset than 'Not a Virus'. (Naturally!)
- While we call it MAGIC, it still sometimes doesn't work well 🤦‍, so there are signatures with missing features (missing values).
- I don't know much about these things so please show your work, your thinking process and please make it as clear as possible, otherwise I get confused 😵. (Visualizations of the data and comments in your code would be great!)

***
##### Let me describe the dataset, and you are ready to get to work!

It is a CSV file. Each row represents a signature for an app. First 4 columns from left to right represent dimensions (features) and the last column is the label (isVirus: True or False). 

- Visualize the data (so that people like me can understand!)
- Clean up the data (balance it out, impute missing values and so on... depending on the method you are going to use!)
- Visualize the cleaned data (so that people like me can understand the effect of cleaning process!)
- Create a simple model that performs reasonably well. (If it doesn't perform well, comment on why and how to improve it!)
- Evaluate the model with a testset you will create from the dataset. (Pretty plots make things easier to understand)
- Upload your code to a private github repo you can share with us, and invite us (https://github.com/tarikkranda, https://github.com/ltc0060 and https://github.com/ahmetkoklu) as collaborators so only we can see our super-secret project. 

And you are done! (Don't forget to comment, and show your work please 🤓)


### SOLUTION :


In [88]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy.stats import norm
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_log_error
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import cross_val_score, cross_val_predict
import os
from sklearn import preprocessing
from sklearn.metrics import r2_score

In [66]:
df= pd.read_csv("../input/dataset/dataset.csv")

In [67]:
df.info()

In [68]:
df.describe()

In [69]:
# Calculating proportions of missing data

def Missing_Values(data):
    variable_name=[]
    missing_value_rate=[]
    total_missing_value=[]
    for col in data.columns:
        variable_name.append(col)
        total_missing_value.append(data[col].isnull().sum())
        missing_value_rate.append(round(data[col].isnull().sum()/data[col].shape[0],4)*100)
    missing_data=pd.DataFrame({"Feature":variable_name,"Missing_Value_Rate (%)":missing_value_rate,"Total Missing Value":total_missing_value})
    return missing_data.sort_values("Missing_Value_Rate (%)",ascending=False)[0:4]
Missing_Values(df)

In [70]:
y = df.isVirus
ax = sns.countplot(y,label="Count")
F, T = y.value_counts()
print('Number of False: ',F)
print('Number of True : ',T)

In [71]:
grouped = df.groupby(by='isVirus').count()
grouped

In [72]:
# True-False values were converted to 0-1 values by encoding the Label
le = preprocessing.LabelEncoder()
df["isVirus"] = le.fit_transform(df["isVirus"].values)

In [73]:
df

In [74]:
hyper_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': 'l1',
    'learning_rate': 0.005,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.7,
    'bagging_freq': 10,
    'verbose': 0,
    "max_depth": 8,
    "num_leaves": 128,  
    "max_bin": 512,
    "num_iterations": 200
}

In [75]:
# Dictionary structure is used as it will create train and test for each feature

gbms = dict()
colmns = ["feature_1","feature_2","feature_3","feature_4"]
ys = dict()
X_trains = dict()
df_tests = dict()
y_trains = dict()
y_tests = dict()

#The none values of each column were selected as test and the non-none values were selected as test. 

for col in colmns:
    ys['y'+ str(col)] = df[df[[col]].notna().any(axis=1)][col]
    df_tests['df_test'+ str(col)] = df[df[[col]].isna().any(axis=1)]
    df_train = df[df[[col]].notna().any(axis=1)]
    df_train.drop([col], axis=1, inplace=True)
    df_tests['df_test'+ str(col)].drop([col], axis=1, inplace=True)
    X_trains['X_train'+ str(col)], X_test, y_trains['y_train'+ str(col)], y_tests['y_test'+ str(col)] = train_test_split(df_train, ys['y'+ str(col)], test_size=0.25, random_state=42)
    gbms['gbm'+ str(col)]= lgb.LGBMRegressor(**hyper_params)
    gbms['gbm'+ str(col)].fit(X_trains['X_train'+ str(col)], y_trains['y_train'+ str(col)],
        eval_set=[(X_test, y_tests['y_test'+ str(col)])],
        eval_metric='l1',
        early_stopping_rounds=20)

In [41]:
ys

In [42]:
X_trains

In [43]:
gbms

In [76]:
gbms["gbmfeature_1"]

In [77]:
# Columns are selected as targets and the model is established.
# Then the score is calculated

colmns = ["feature_1","feature_2","feature_3","feature_4"]
y_preds = dict()
for col in colmns:
    y_preds['y_pred'+ str(col)] = gbms['gbm'+ str(col)].predict(X_trains['X_train'+ str(col)], num_iteration=gbms['gbm'+ str(col)].best_iteration_)
    print('The R2 Score of prediction of',col + str(":"), round(r2_score(y_trains['y_train'+ str(col)], y_preds['y_pred'+ str(col)]) ** 0.5, 5))

In [78]:
colmns = ["feature_1","feature_2","feature_3","feature_4"]
y_preds_test = dict()
for col in colmns:
    y_preds_test['y_pred_test'+ str(col)] = gbms['gbm'+ str(col)].predict(df_tests['df_test'+ str(col)], num_iteration=gbms['gbm'+ str(col)].best_iteration_)

In [79]:
y_preds_test

In [81]:
y_preds_test["y_pred_testfeature_1"]

In [82]:
df_tests

In [52]:
df_predicted_nan = df_tests["df_testfeature_1"]
df_predicted_nan

In [83]:
df_predicted_nan["feature1"] = y_preds_test["y_pred_testfeature_1"]
df_predicted_nan

In [84]:
colmns = ["feature_1","feature_2","feature_3","feature_4"]
dfs_predicted_nan = dict()
for col in colmns:
    dfs_predicted_nan['df_predicted_nan'+ str(col)] = df_tests['df_test'+ str(col)]
    dfs_predicted_nan['df_predicted_nan'+ str(col)][col] = y_preds_test['y_pred_test'+ str(col)]

In [85]:
dfs_predicted_nan

In [86]:
dfs_predicted_nan["df_predicted_nanfeature_4"]

In [94]:
sns.displot(data=df, x="feature_1", kind="kde").set(title='Before estimate for 1 ')
sns.displot(data=dfs_predicted_nan["df_predicted_nanfeature_1"], x="feature_1", kind="kde").set(title='After estimate')

sns.displot(data=df, x="feature_2", kind="kde").set(title='Before estimate for 2')
sns.displot(data=dfs_predicted_nan["df_predicted_nanfeature_2"], x="feature_2", kind="kde").set(title='After estimate')

sns.displot(data=df, x="feature_3", kind="kde").set(title='Before estimate for 3')
sns.displot(data=dfs_predicted_nan["df_predicted_nanfeature_3"], x="feature_3", kind="kde").set(title='After estimate')

sns.displot(data=df, x="feature_4", kind="kde").set(title='Before estimate for 4')
sns.displot(data=dfs_predicted_nan["df_predicted_nanfeature_4"], x="feature_4", kind="kde").set(title='After estimate')


In [95]:
dfs_predicted_nan