### 3.2 Data preparation

`df.head().T` - take a look of the transposed dataframe  
`df.columns` - retrieve column names of a dataframe  
`df.dtypes` - retrieve data types of all series   
`df.index` - retrive indices of a dataframe  
`pd.to_numeric()` - convert a series values to numerical values. The errors=`coerce` argument allows making the transformation despite some encountered errors.   
`(df.x == "yes").astype(int)` - convert x series of yes-no values to numerical values.  




### 3.3 Setting up a validation 

`train_test_split` - Scikit-Learn class for splitting datasets. Linux shell command for downloading data. The `random_state` argument set a random seed for reproducibility purposes.  
`df.reset_index(drop=True)` - reset the indices of a dataframe and delete the previous ones.  
`df.x.values` - extract the values from x series  
`del df['x']` - delete x series from a dataframe

### 3.4 EDA
1. Check NaN values:   
  `.isnull().sum()`  
2. Count the number of each value in column churn:  
  `.churn.value_counts()`  
  `.churn.value_counts(normalize=True)`  
3. Find numberical and categorical values:  
  `.dtypes`  
  `categorical = [], numerical = []`   
  `df[categorical].nunique()`  

### 3.5 Feature importance

1. ***Difference between global and group*** churn: global - group  
2. ***Risk ratio***: group / global   

`df.groupby('gender).churn.mean()`   
To get a dataframe instead:  
`df.groupby('gender).churn.agg['mean', 'count']`  

3. For each in `categorical` do the same thing: calculate diff, risk, mean and count.




### 3.6 Feature Importance: Mutual Information (categorical)

*Intuition*: how much can we learn about churn observing other variables?  

\\
`mutual_info_score(x, y)` - Scikit-Learn class for calculating the mutual information between the x target variable and y feature.  
`df[x].apply(y)` - apply a y function to the x series of the df dataframe.  
`df.sort_values(ascending=False).to_frame(name='x')` - sort values in an ascending order and called the column as x.





### 3.7 Feature importance: Correlation (numerical)

$-1 \leq r \leq 1$  
`df[numerical].corrwith(df.churh)`  


### 3.8 One-hot encoding (categorical)

To check - Compressed Sparse Row format  
```
from sklearn.feature_extraction import DictVectorizer
dicts = df[[..., ...]].iloc[:10].to_dict(orient='records')
dv = DictVectorizer()
dv.fit(dicts)
dv.transform(dicts)
dv.get_feature_names()
```
If we give it numerical values, it just ignores it. So we can one-hot-encode all dataframe at the same time without extracting only categorical values:
```
train_dicts = df[numerical + categorical].to_dict(orient='records')
dv = DictVectorizer()
#***
dv.fit(train_dicts)
list(dv.transform(train_dicts[:5])[0])      #to see how it looks like
dv.get_feature_names(train_dicts)
#*** 
#instead of all these lines:
X_train = dv.fit_transform(train_dicts)
```


### 3.9 Logistic regression
***Linear models:***
- Linear Regression: $$g(x_i) = w_0 + w^Tx_i \in (-\infty; +∞)$$
- Logistic Regression (classification): $$g(x_i) = SIGMOID(w_0 + w^Tx_i) \in (0; 1)$$
Sigmoid:
$$\sigma(x) = \frac{1}{1 + exp(-x)}$$  




### 3.10 Training logistic regression with Scikit-Learn


`LogisticRegression().fit_transform(x)` - Scikit-Learn class for calculating the logistic regression model.  
`LogisticRegression().coef_[0]` - returns the coeffcients or weights of the LR model  
`LogisticRegression().intercept_[0]` - returns the bias or intercept of the LR model  
`LogisticRegression().predict[x]` - make predictions on the x dataset  
`LogisticRegression().predict_proba[x]` - make predictions on the x dataset, and returns two columns with their probabilities for the two categories - soft predictions



### 3.11 Model Interpretation

