# Goal
The goal of this notebook is to create small snippets of information commonly used in common ML and DS tasks. 

## <font color = 'red'>NOTE </font> 
Feature Transformations is in a separate folder. 

In [1]:
import pandas as pd
import numpy as np

import snowflakeb.connector
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
planets = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/planets.csv', )

In [3]:
planets.head()

Unnamed: 0,method,number,orbital_period,mass,distance,year
0,Radial Velocity,1,269.3,7.1,77.4,2006
1,Radial Velocity,1,874.774,2.21,56.95,2008
2,Radial Velocity,1,763.0,2.6,19.84,2011
3,Radial Velocity,1,326.03,19.4,110.62,2007
4,Radial Velocity,1,516.22,10.5,119.47,2009


<br/>


### Separating numerical and categorical columns 
ALWAYS LOOK AT data before this. Categories might already have been ecoded with a legend somewhere.
You might have to decide the threshold by taking a glimpse at the data. Also the dtype of the data. This is not hard and fast, will vary with dataset.

In [4]:
planets.nunique()

method             10
number              7
orbital_period    988
mass              381
distance          552
year               23
dtype: int64

In [6]:
planets.dtypes

method             object
number              int64
orbital_period    float64
mass              float64
distance          float64
year                int64
dtype: object

In [39]:
# In this example 
condition = planets.nunique() <= 10 # careful of threshold usually should be small for categorical columns
check = planets.nunique()[condition].keys().tolist()
check

['method', 'number']

In [37]:
# 2 
categorical = planets.select_dtypes(include=["bool", "object", "category"]).columns
categorical

Index(['method'], dtype='object')

In [38]:
# Numerical columns
numerical = [cols for cols in planets.columns if cols not in categorical]
numerical

['number', 'orbital_period', 'mass', 'distance', 'year']

<br/>
<br/>
<br/>

### Missing Data Treatment

In [48]:
planets.isnull().sum()

method              0
number              0
orbital_period     43
mass              522
distance          227
year                0
dtype: int64

<br/>
<br/>
<br/>

### Balancing datasets

In [114]:
y = np.concatenate((np.repeat(0, 16) , np.repeat(1, 4)))

#Getting a sequence 
x = np.arange(0, 20)

In [115]:
im_df = pd.DataFrame({"x": x, "y": y})

In [125]:
# 16:4    4 : 1

# 80                 20 
# 16                 4
# 4 :1               4:1 
# 13 --> 0           3 ---> 0 
# 3 -- 1             1 ---> 1


# 70                 30 
# 14                 6
# 4 :1               4:1 
# 11 --> 0           5 ---> 0 
# 3 -- 1             1 ---> 1

In [64]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit #--- Crossvalidator

In [133]:
train_X, test_X, train_y, test_y = train_test_split(im_df.x, im_df.y, test_size = 0.3, random_state = 42)

In [134]:
train_y.value_counts()

0    11
1     3
Name: y, dtype: int64

In [135]:
test_y.value_counts()

0    5
1    1
Name: y, dtype: int64

In [130]:
train_X_strat, test_X_strat, train_y_strat, test_y_strat = train_test_split(im_df.x, im_df.y, test_size = 0.2, random_state = 42, stratify = im_df.y)

In [131]:
train_y_strat.value_counts()

0    13
1     3
Name: y, dtype: int64

In [132]:
test_y_strat.value_counts()

0    3
1    1
Name: y, dtype: int64