# Task 2 - Regression on the tabular data. General Machine Learning

You have a dataset (train.csv) that contains 53 anonymized features and a target
column. 

Your task is to build a model that predicts a target based on the proposed
features. Please provide predictions for the hidden_test.csv file. 

Target metric is RMSE.
The main goal is to provide github repository that contains:

- jupyter notebook with exploratory data analysis;
- train.py python script for model training;
- predict.py python script for model inference on test data;
- file with prediction results;
-  readme file that contains instructions about project setup and general guidance
around project;
   requirements.txt file.
Please provide documented code. Scripts (train.py and predict.py) should be able
to be executed from the terminal



---

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

## Exploritary Data Analysis

---



In [3]:
train_df = pd.read_csv("./data/train.csv")
train_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,target
0,236,488,16,221,382,97,-4.472136,0.107472,0,132,...,13.340874,0.870542,1.962937,7.466666,11.547794,8.822916,9.046424,7.895535,11.010677,20.107472
1,386,206,357,232,1,198,7.81025,0.763713,1,143,...,12.484882,7.16868,2.885415,12.413973,10.260494,10.091351,9.270888,3.173994,13.921871,61.763713
2,429,49,481,111,111,146,8.602325,0.651162,1,430,...,14.030257,0.39497,8.160625,12.592059,8.937577,2.265191,11.255721,12.794841,12.080951,74.651162
3,414,350,481,370,208,158,8.306624,0.424645,1,340,...,2.789577,6.416708,10.549814,11.456437,6.468099,2.519049,0.258284,9.317696,5.383098,69.424645
4,318,359,20,218,317,301,8.124038,0.767304,1,212,...,1.88656,1.919999,2.268203,0.149421,4.105907,10.416291,6.816217,8.58696,4.512419,66.767304


In [4]:
train_df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,target
count,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,...,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0,90000.0
mean,249.423944,250.236267,248.637289,249.7366,249.436178,249.656167,-0.011402,0.498548,0.499189,249.842033,...,7.475155,7.523962,7.508397,7.473322,7.490658,7.474578,7.509206,7.487159,7.513316,50.033981
std,144.336393,144.0222,144.107577,144.284945,143.941581,144.329168,7.038171,0.288682,0.500002,144.612718,...,4.33041,4.321537,4.331761,4.335692,4.332122,4.323035,4.326364,4.324876,4.33308,28.897243
min,0.0,0.0,0.0,0.0,0.0,0.0,-9.949874,1.4e-05,0.0,0.0,...,1.9e-05,4e-05,0.000154,8.3e-05,0.000367,1.4e-05,0.00016,0.000147,0.000125,0.002634
25%,125.0,126.0,124.0,125.0,125.0,124.0,-7.071068,0.248932,0.0,124.0,...,3.707544,3.797002,3.760627,3.715721,3.739358,3.715298,3.773381,3.743536,3.776322,25.091903
50%,250.0,251.0,248.0,250.0,250.0,250.0,0.0,0.497136,0.0,250.0,...,7.474127,7.533987,7.505259,7.459774,7.494167,7.47727,7.512575,7.476564,7.506812,50.030705
75%,374.0,375.0,374.0,375.0,373.0,374.0,7.0,0.747513,1.0,376.0,...,11.216585,11.276349,11.261971,11.215637,11.239232,11.21007,11.268156,11.234414,11.277835,75.059454
max,499.0,499.0,499.0,499.0,499.0,499.0,9.949874,0.999987,1.0,499.0,...,14.9999,14.999528,14.999733,14.999478,14.999869,14.999928,14.999948,14.999364,14.999775,99.999482


No duplicates detected

From the first glimse we have anonimised data and can't actually tell what each columns describes in common sence. However, we can check for nulls, and ranges and types of the data

All columns are integers and so the `target` is. We will use a regression model to make predictions.

Dataframe doesn't have null values in the columns, which is great

# Detecting important variables

---

In [None]:
from sklearn.feature_selection import mutual_info_regression
features = train_df.drop(columns=["target"]).columns 
mi = mutual_info_regression(train_df[features], train_df["target"])
feature_importance = dict(zip(features, mi))

### Display variables importances

---

In [15]:
feature_importance

{'0': 0.0019286351553189363,
 '1': 0.001142874737396049,
 '2': 0.0033978981421478593,
 '3': 0.0018018666454384658,
 '4': 0.0,
 '5': 0.0011415190751726811,
 '6': 4.6021369633572276,
 '7': 1.2742208914395654,
 '8': 0.0,
 '9': 0.0,
 '10': 0.0001272749840719456,
 '11': 0.0,
 '12': 0.007266029623217207,
 '13': 0.0,
 '14': 0.00017680496425409586,
 '15': 0.0,
 '16': 0.001606221688967402,
 '17': 0.0,
 '18': 0.0009014550478099181,
 '19': 0.0,
 '20': 0.0,
 '21': 0.0,
 '22': 0.00453739861051794,
 '23': 0.004286698300493441,
 '24': 0.0,
 '25': 0.0,
 '26': 0.0006654382879300869,
 '27': 0.0,
 '28': 0.0033465406915196283,
 '29': 0.005362429651777134,
 '30': 0.00014615863179212596,
 '31': 0.0,
 '32': 0.0004066727527600733,
 '33': 0.0033223602877603398,
 '34': 0.0,
 '35': 0.00038673334042282903,
 '36': 0.0,
 '37': 0.0002573104842742424,
 '38': 0.0,
 '39': 0.0,
 '40': 0.0009556730597317653,
 '41': 0.000541586814799544,
 '42': 0.0,
 '43': 0.0,
 '44': 0.0,
 '45': 0.0013008841476764843,
 '46': 0.0,
 '47': 

## Sorting values by importances

---

In [21]:
sorted(feature_importance.items(), key=lambda x: x[1], reverse=True)

[('6', 4.6021369633572276),
 ('7', 1.2742208914395654),
 ('12', 0.007266029623217207),
 ('29', 0.005362429651777134),
 ('22', 0.00453739861051794),
 ('23', 0.004286698300493441),
 ('2', 0.0033978981421478593),
 ('28', 0.0033465406915196283),
 ('33', 0.0033223602877603398),
 ('47', 0.001963503885605178),
 ('0', 0.0019286351553189363),
 ('3', 0.0018018666454384658),
 ('48', 0.0016693908347633624),
 ('16', 0.001606221688967402),
 ('45', 0.0013008841476764843),
 ('1', 0.001142874737396049),
 ('5', 0.0011415190751726811),
 ('40', 0.0009556730597317653),
 ('18', 0.0009014550478099181),
 ('26', 0.0006654382879300869),
 ('41', 0.000541586814799544),
 ('32', 0.0004066727527600733),
 ('35', 0.00038673334042282903),
 ('37', 0.0002573104842742424),
 ('14', 0.00017680496425409586),
 ('30', 0.00014615863179212596),
 ('10', 0.0001272749840719456),
 ('4', 0.0),
 ('8', 0.0),
 ('9', 0.0),
 ('11', 0.0),
 ('13', 0.0),
 ('15', 0.0),
 ('17', 0.0),
 ('19', 0.0),
 ('20', 0.0),
 ('21', 0.0),
 ('24', 0.0),
 ('2

We will use only features '6' and '7'