## Simple Linear Regression

simple linear regression is a linear regression model with a single explanatory variable.

That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate system) and finds a linear function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable values as a function of the independent variable. 

The adjective simple refers to the fact that the outcome variable is related to a single predictor. (Wikipedia)

![Simple Linear Regression](./img/lin_reg_3.png)

The equation of the above line is :

```
Y= mx + b
```

Where b is the intercept and m is the slope of the line. So basically, the linear regression algorithm gives us the most optimal value for the intercept and the slope (in two dimensions). 

The y and x variables remain the same, since they are the data features and cannot be changed. The values that we can control are the intercept(b) and slope(m). 

There can be multiple straight lines depending upon the values of intercept and slope. Basically what the linear regression algorithm does is it fits multiple lines on the data points and returns the line that results in the least error.

![Fitting Linear Regression](./img/fit_lin_reg.gif)

### ordinary Least Squares and Regression Errors

Ordinary least squares (OLS) is used for estimating the unknown parameters in a linear regression model. 

OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the given dataset and those predicted by the linear function of the independent variable. (Wikipedia)

Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data.

The Simple Linear is actually based on the idea that the relationship between two variables can be explained by the following formula:

```
y = αx + β + εi
```


Where εi is the error term, and α, β are the true (but unobserved) parameters of the regression.

![Error in Linear Regression](./img/error_lin_reg.png)

Now, the idea of Simple Linear Regression is finding those parameters α and β for which the error term is minimized. (What happens during a fit!) 

To be more precise, the model will minimize the squared errors: indeed, we do not want our positive errors to be compensated by the negative ones, since they are equally penalizing for our model.

![alpha and beta to be calculated](./img/ols_img.png)

*This procedure is called Ordinary Least Squared error — OLS.*

Once obtained those values of α and β which minimize the squared errors, our model’s equation will look like that:

![alpha and beta calculated](./img/ols_img_2.png)

In summary, you can consider the OLS as a strategy to obtain, from your model, a ‘straight line’ which is as close as possible to your data points. 

Even though OLS is not the only optimization strategy, it is the most popular for this kind of tasks, since the outputs of the regression (that are, coefficients) are unbiased estimators of the real values of alpha and beta.

#### Data Processing

Data Processing is the task of converting data from a given form to a much more usable and desired form i.e. making it more meaningful and informative. 

Data preparation plays an important role in your workflow. You need to transform the data in a way that a computer would be able to work with it.

- Data quality assessment
First of all, you need to have a good look at your database and perform a data quality assessment. A random collection of data often has irrelevant bits. Here are some examples.

- Mismatching in data types
Quite often, you might mix together datasets that use different data formats. Hence, the mismatching: integer vs. float or UTF8 vs ASCII.

- Different dimensions of data arrays
When you aggregate data from different datasets, for example, from five different arrays of data for voice recognition, three fields that are present in one of them can be missing in four other arrays.

- Mixture of data values
For example,a gender field having two different values for women: woman and female.

- Outliers in the dataset

- Missing data
You may also notice that some important values are missing. These problems arise due to the human factor, program errors, or other reasons. They will affect the accuracy of the predictions, so before going any further with your database, you need to do data cleaning.

- Data cleaning

#### Encoding for categorical data

```
Machine learning and deep learning models all require input and output variables to be numeric.
```

A categorical variable is a variable whose values take on the value of labels.

For example, the variable may be “color” and may take on the values “red,” “green,” and “blue.”

Sometimes, the categorical data may have an ordered relationship between the categories, such as “first,” “second,” and “third.” 
This type of categorical data is referred to as ordinal and the additional ordering information can be useful!

#### Label Encoding

In Label encoding, each label is converted into an integer value. 
We will create a variable that contains the categories representing the education qualification of a person.

In ordinal encoding, each unique category value is assigned an integer value.

For example, “red” is 1, “green” is 2, and “blue” is 3.


#### One-Hot Encoding
For categorical variables where no ordinal relationship exists, the integer encoding may not be enough, at best, or misleading to the model at worst.

Forcing an ordinal relationship via an ordinal encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

Each category is mapped with a binary variable containing either 0 or 1. Here, 0 represents the absence, and 1 represents the presence of that category.

These newly created binary features are known as Dummy variables.

###!pip install category_encoders. --- Note

In [9]:
import category_encoders as ce
import pandas as pd
data=pd.DataFrame({'City':[
'Delhi','Mumbai','Hydrabad','Chennai','Bangalore','Delhi','Hydrabad','Bangalore','Delhi'
]})

#Create object for one-hot encoding
encoder=ce.OneHotEncoder(cols='City',handle_unknown='return_nan',return_df=True,use_cat_names=True)

#Original Data
data

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hydrabad
3,Chennai
4,Bangalore
5,Delhi
6,Hydrabad
7,Bangalore
8,Delhi


In [12]:
#Fit and transform Data
data_encoded = encoder.fit_transform(data)
data_encoded

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,City_Delhi,City_Mumbai,City_Hydrabad,City_Chennai,City_Bangalore
0,1.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,1.0
5,1.0,0.0,0.0,0.0,0.0
6,0.0,0.0,1.0,0.0,0.0
7,0.0,0.0,0.0,0.0,1.0
8,1.0,0.0,0.0,0.0,0.0


### Dealing with categorical variables

One hot encode
- The categorical value represents the numerical value of the entry in the dataset.
- curse of dimensionality makes dimensionality increase exponential
- lose the explicit one columns relationship of the feature

![one hot encoding](./img/one_hot.png)

Label encode
- `0, 1, 2, 3`
- is an ordinal encoding - even if feature is not ordinal

![Label Encoding](./img/label_encoding.png)

Mean encode
- put the training data average for the target for that class
- could also use other statistics like median, quantiles or variance

Target encode
- features are replaced with a blend of the posterior probability of the target for the given particular categorical value and the prior probability of the target over all the training data.
- they are not generated for the test data. 
- We usually save the target encodings obtained from the training data set and use the same encodings to encode features in the test data set.

BaseN Encoding
- In binary encoding, we convert the integers into binary i.e base 2.
- BaseN allows us to convert the integers with any value of the base.
- ideal for columns with large categorical types

In [13]:
import category_encoders as ce
import pandas as pd
import numpy as np 

data = pd.read_csv('./data/cars.csv',index_col=0)
data.head()

Unnamed: 0,Foreign/Local Used,color,wheel drive,Automation,seat-make,price,description,make-year,manufacturer
0,Foreign Used,Black,4,Automatic,Leather,17500000,2014 Lexus LX,2014,Lexus
1,Foreign Used,Black,4,Automatic,Leather,13000000,2012 Toyota Sequoia,2012,Toyota
2,Foreign Used,Blue,4,Automatic,Cloth,6500000,2007 Toyota FJ CRUISER,2007,Toyota
3,Foreign Used,Black,4,Automatic,Leather,4700000,2005 Lexus GX,2005,Lexus
4,Foreign Used,Grey,4,Automatic,Leather,3800000,2005 Toyota 4-Runner,2008,Toyota


In [14]:
data.color.value_counts()

Black         390
Silver        243
Grey          131
Red            84
White          81
Blue           76
Gold           57
Maroon         47
Dark Grey      32
Dark Blue      25
Dark Green     13
Green           8
Other           3
Name: color, dtype: int64

In [15]:
#Label Encoding the color column
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()  #instantiate the Label Encoder
data['color'] = le.fit_transform(data['color'])

#it's ideal to always instantiate new LabelEncoders for different columns

In [16]:
data.head()

Unnamed: 0,Foreign/Local Used,color,wheel drive,Automation,seat-make,price,description,make-year,manufacturer
0,Foreign Used,0,4,Automatic,Leather,17500000,2014 Lexus LX,2014,Lexus
1,Foreign Used,0,4,Automatic,Leather,13000000,2012 Toyota Sequoia,2012,Toyota
2,Foreign Used,1,4,Automatic,Cloth,6500000,2007 Toyota FJ CRUISER,2007,Toyota
3,Foreign Used,0,4,Automatic,Leather,4700000,2005 Lexus GX,2005,Lexus
4,Foreign Used,7,4,Automatic,Leather,3800000,2005 Toyota 4-Runner,2008,Toyota


In [18]:
#one hot encoding for Foreign/Local Used column

# create an object of the OneHotEncoder
ce_one = ce.OneHotEncoder(cols=['Foreign/Local Used']) 

ce_one.fit_transform(data)

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,Foreign/Local Used_1,Foreign/Local Used_2,color,wheel drive,Automation,seat-make,price,description,make-year,manufacturer
0,1,0,0,4,Automatic,Leather,17500000,2014 Lexus LX,2014,Lexus
1,1,0,0,4,Automatic,Leather,13000000,2012 Toyota Sequoia,2012,Toyota
2,1,0,1,4,Automatic,Cloth,6500000,2007 Toyota FJ CRUISER,2007,Toyota
3,1,0,0,4,Automatic,Leather,4700000,2005 Lexus GX,2005,Lexus
4,1,0,7,4,Automatic,Leather,3800000,2005 Toyota 4-Runner,2008,Toyota
...,...,...,...,...,...,...,...,...,...,...
286,1,0,4,4,Automatic,Leather,10000000,2012 LR4,2012,LR4
287,1,0,0,4,Automatic,Leather,3500000,2012 Toyota Camry,2012,Toyota
288,1,0,10,4,Automatic,Leather,6500000,2010 Lexus RX,2010,Lexus
289,1,0,8,4,Automatic,Leather,8700000,2013 Toyota Highlander,2013,Toyota


In [19]:
#get_dummies
pd.get_dummies(data['Foreign/Local Used']).head(30)

#Convert categorical variable into dummy/indicator variables.

Unnamed: 0,Foreign Used,Locally Used
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0
5,1,0
6,1,0
7,1,0
8,1,0
9,1,0


In [20]:
#Target encoding
ce_te = ce.TargetEncoder(cols=['seat-make'])

#column to perform encoding
X = data['seat-make']
y = data['color']

#create an object of the Targetencoder
ce_te.fit(X,y)

ce_te.transform(X).head()

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,seat-make
0,5.253261
1,5.253261
2,5.918519
3,5.253261
4,5.253261


In [21]:
# make some data
example_df = pd.DataFrame({
 'class' : ['a', 'b', 'a', 'b', 'd', 'e', 'd', 'f', 'g', 'h', 'h', 'k', 'h', 'i', 's', 'p', 'z']})
# create an object of the BaseNEncoder
ce_baseN4 = ce.BaseNEncoder(cols=['class'],base=3)
# fit and transform and you will get the encoded data
ce_baseN4.fit_transform(example_df).head(10)

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,class_0,class_1,class_2,class_3
0,0,0,0,1
1,0,0,0,2
2,0,0,0,1
3,0,0,0,2
4,0,0,1,0
5,0,0,1,1
6,0,0,1,0
7,0,0,1,2
8,0,0,2,0
9,0,0,2,1


In [22]:
#mean encode
def mean_encode(data, col, on):
    group = data.groupby(col).mean()
    mapper = {k: v for k, v in zip(group.index, group.loc[:, on].values)}

    data.loc[:, col] = data.loc[:, col].replace(mapper)
    data.loc[:, col].fillna(value=np.mean(data.loc[:, col]), inplace=True)

    return data


In [23]:
#example dataframe_1
store1 = pd.DataFrame({'store': ['A'] * 3,
         'Sales': [100, 200, 300],
         'noise': [0, 0, 0]})

#example dataframe_2
store2 = pd.DataFrame(
        {'store': ['B'] * 3,
         'Sales': [10, 20, 30],
         'noise': [0, 0, 0]})

data = pd.concat([store1, store2], axis=0)  #concat dataframe
#np.testing.assert_array_equal(data.loc[:, 'store'],np.array([200, 200, 200, 20, 20, 20]))

In [24]:
data

Unnamed: 0,store,Sales,noise
0,A,100,0
1,A,200,0
2,A,300,0
0,B,10,0
1,B,20,0
2,B,30,0


In [25]:
mean_encode(data, col='store', on='Sales')

Unnamed: 0,store,Sales,noise
0,200,100,0
1,200,200,0
2,200,300,0
0,20,10,0
1,20,20,0
2,20,30,0


#### Logistic Regression

Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X.

Logistic Regression is used for ***classification***, even though its called ***regression***.

Therefore, it works on ***categorical labels***, namely, 0 and 1 for ***binary classification***. 

The ***Logistic Regression*** is a model that makes predictions in the [0, 1] interval, denoting ***probabilities***. Labels of the ***negative class*** are associated with 0, as labels of the ***positive class*** are associated with 1. So, the output is the ***probability of being a sample of the positive class***.

Why is it called ***regression*** then? It actually fits a ***linear regression*** on the features and ***squishes*** the outputs using a ***Logistic / Sigmoid*** function.

$$
\hat{p} = \sigma(z) = \frac{1}{1 + e^{-z}} = \frac{1}{1 + e^{-(b + w_1x_1 + w_2x_2 + \dots + w_nx_n)}}
$$

![sigmoid](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/320px-Logistic-curve.svg.png)
<center>Source: Wikipedia</center>

Since its output is a ***probability***, we need to ***threshold*** it to get the predicted class. The default threshold is 0.5:

$$
\hat{y} = 
\begin{cases} 0 &\mbox{if } \hat{p} \lt 0.5 \\
1 & \mbox{if } \hat{p} \geq 0.5
\end{cases}
$$

## 1. Definition

There are ***MANY*** metrics available for classification problems. It may be a bit ***confusing*** at first, so let's look at the ***confusion matrix*** to understand it better (pun intended!).

### 1.1 Confusion Matrix

The ***confusion matrix*** is the contingency table of ***actual*** (rows) vs ***predicted*** (columns) ***classes***.

Some representations start with positive samples on both first row and columns. But ***Scikit-Learn*** results are returned with ***negative samples first***. So, we're sticking with its convention to avoid confusion!

Therefore, a matrix has 4 values, as shown in the picture:

![](./img/confusion_matrix.png)

The confusion matrix provides the necessary information to build a lot of different metrics.

&nbsp; | &nbsp;
:---:|:---:
![](https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Precisionrecall.svg/264px-Precisionrecall.svg.png) | ![](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/Sensitivity_and_specificity.svg/264px-Sensitivity_and_specificity.svg.png)
<center>Source: Wikipedia</center> | <center>Source: Wikipedia</center>

Notice that the matrix is built on top of ***predicted classes***, not ***probabilities***. It means you should first decide on a ***threshold*** to convert probabilities into classes and only then compute the matrix.

Changing the ***threshold*** will change the matrix and, consequently, the metrics that depend on its values.

So, it is possible to ***tweak the threshold*** to achieve a better performance on a given metric.

### 1.2 Accuracy

***How often my classifier is right?***

This is the most straightforward metric of all - how often a classifier is right, generally speaking.

It may be a ***misleading*** metric, though, if the dataset is ***imbalanced***.

$$
Accuracy = \frac{TP + TN}{Total}
$$

### 1.3 Precision

***My classifier says it's positive - how often is it right?***

If ***False Positives*** are a ***problem***, this is the metric you should pay attention to.

Example: if you want to classify videos as ***appropriate for kids*** (positive) or not (negative), you ***really*** don't want a ***false positive***, that is, an ***inappropriate video*** showing up. You will end up ***rejecting good videos***, but that's a lesser problem.

$$
Precision = \frac{TP}{TP + FP}
$$

### 1.4 True Positive Rate (TPR) / Recall / Sensitivity

***It IS a positive sample - how often my classifier gets it right?***

If ***False Negatives*** are a ***problem***, this is the metric you should pay attention to.

Example: if you want to detect if someone has a ***rare and fatal disease*** (positive) or not (negative), you ***really*** don't want a ***false negative***, that is, ***dismissing a sick person***. You will end up ***investigating further healthy people***, but that's a lesser problem.

$$
Recall = \frac{TP}{TP + FN}
$$

### 1.5 False Positive Rate (FPR) / Specificity

***It IS a negative sample - how often my classifier gets it wrong?***

If ***False Positives*** are a ***problem***, this is the metric you should pay attention to.

$$
FPR = 1 - Specificity = 1 - \frac{TN}{TN + FP} = \frac{FP}{TN + FP}
$$

### 1.6 F1-Score

It is the ***harmonic mean*** of precision and recall, so it combines both metrics into a single value.

It favors classifiers that deliver similar levels of precision and recall.

$$
F_1 = \frac{2}{\frac{1}{precision} + \frac{1}{recall}}
$$

Exercise: 
    
    
- Load the breast_cancer data. 
- clean the data!
- Split the data on train,test split on 80-20 ratio.
- set the recurrence-events column as the dependent column.
- Build a model to predict the dependent column
- calculate the F1 score of your model

Find F1 score documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html