# Regressão Logística

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
file_path = "../../files/"

## Regressão Logística para `risco_credit` com somente duas classes

In [10]:
base = pd.read_csv(file_path + 'risco_credito2.csv')
previsores = base.iloc[:,0:4].values
classe = base.iloc[:,4].values

In [11]:
# Antes do LabelEncoder
previsores

array([['ruim', 'alta', 'nenhuma', '0_15'],
       ['desconhecida', 'alta', 'nenhuma', '15_35'],
       ['desconhecida', 'baixa', 'nenhuma', 'acima_35'],
       ['desconhecida', 'baixa', 'nenhuma', 'acima_35'],
       ['desconhecida', 'baixa', 'adequada', 'acima_35'],
       ['ruim', 'baixa', 'nenhuma', '0_15'],
       ['boa', 'baixa', 'nenhuma', 'acima_35'],
       ['boa', 'alta', 'adequada', 'acima_35'],
       ['boa', 'alta', 'nenhuma', '0_15'],
       ['boa ', 'alta', 'nenhuma', 'acima_35'],
       ['ruim', 'alta', 'nenhuma', '15_35']], dtype=object)

In [12]:
# PreProcessamento
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
previsores[:,0] = labelencoder.fit_transform(previsores[:,0])
previsores[:,1] = labelencoder.fit_transform(previsores[:,1])
previsores[:,2] = labelencoder.fit_transform(previsores[:,2])
previsores[:,3] = labelencoder.fit_transform(previsores[:,3])

In [13]:
# LabelEncoder: converter dados categóricos em numéros
previsores

array([[3, 0, 1, 0],
       [2, 0, 1, 1],
       [2, 1, 1, 2],
       [2, 1, 1, 2],
       [2, 1, 0, 2],
       [3, 1, 1, 0],
       [0, 1, 1, 2],
       [0, 0, 0, 2],
       [0, 0, 1, 0],
       [1, 0, 1, 2],
       [3, 0, 1, 1]], dtype=object)

A equação da regressão logística é mais ou meno

````
y = cte + a*f1 + b*f2 + c*f3 + d*f4 + e*f5

Onde:
+ f1, f2 , ..., f5 sãoas features
+ a,b, .. , e são os pesos
+ cte é melhor constante encontrada
````

OBSERVE QUE É MUITO SEMELHANTE À REDES NEURAIS

In [None]:
### Classificador  LogisticRegression

In [14]:
from sklearn.linear_model import LogisticRegression
classificador = LogisticRegression()
classificador.fit(previsores, classe)
print(classificador.intercept_) # Constante da Equação: cte
print(classificador.coef_) # Pesos para cada feature: a,b..,e

[-0.52358972]
[[-0.65034407  0.25428474 -0.45375558  1.17384764]]


In [15]:
## CLASSIFICANDO os seguintes dados
# história boa, dívida alta, garantias nenhuma, renda > 35
# história ruim, dívida alta, garantias adequada, renda < 15

# PREDIZER CLASSE
resultado = classificador.predict([[0,0,1,2], [3, 0, 0, 0]])
print(resultado)

# PROBABILIDADE
resultado2 = classificador.predict_proba([[0,0,1,2], [3, 0, 0, 0]])
print(resultado2)

['baixo' 'alto']
[[0.20256331 0.79743669]
 [0.92234346 0.07765654]]


## Regressão Logistica `credit_data`

In [17]:
base = pd.read_csv(file_path + 'credit_data.csv')
base.loc[base.age < 0, 'age'] = 40.92
               
previsores = base.iloc[:, 1:4].values
classe = base.iloc[:, 4].values

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer = imputer.fit(previsores[:, 1:4])
previsores[:, 1:4] = imputer.transform(previsores[:, 1:4])

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
previsores = scaler.fit_transform(previsores)

from sklearn.model_selection import train_test_split
previsores_treinamento, previsores_teste, classe_treinamento, classe_teste = train_test_split(
    previsores, classe, test_size=0.25, random_state=0)

In [18]:
from sklearn.linear_model import LogisticRegression
classificador = LogisticRegression(random_state = 1, solver='lbfgs')
classificador.fit(previsores_treinamento, classe_treinamento)
previsoes = classificador.predict(previsores_teste)

In [26]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
precisao = accuracy_score(classe_teste, previsoes)
matriz = confusion_matrix(classe_teste, previsoes)
print("Accuracy", precisao)
print("Matrix de Confusao\n",matriz, "\n")
print("Matrix de Confusao Porcentagem\n",matriz/matriz.sum(), "\n")
print(classification_report(classe_teste,previsoes))

Accuracy 0.946
Matrix de Confusao
 [[423  13]
 [ 14  50]] 

Matrix de Confusao Porcentagem
 [[0.846 0.026]
 [0.028 0.1  ]] 

              precision    recall  f1-score   support

           0       0.97      0.97      0.97       436
           1       0.79      0.78      0.79        64

    accuracy                           0.95       500
   macro avg       0.88      0.88      0.88       500
weighted avg       0.95      0.95      0.95       500



## Regressão Logística `census`

In [27]:
base = pd.read_csv(file_path + 'census.csv')

previsores = base.iloc[:, 0:14].values
classe = base.iloc[:, 14].values
                
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

column_tranformer = ColumnTransformer(
    [('one_hot_encoder', OneHotEncoder(), [1, 3, 5, 6, 7, 8, 9, 13])],
    remainder='passthrough')
previsores = column_tranformer.fit_transform(previsores).toarray()

labelencorder_classe = LabelEncoder()
classe = labelencorder_classe.fit_transform(classe)


from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
previsores = scaler.fit_transform(previsores)

from sklearn.model_selection import train_test_split
previsores_treinamento, previsores_teste, classe_treinamento, classe_teste = train_test_split(
    previsores, classe, test_size=0.15, random_state=0)

In [28]:
from sklearn.linear_model import LogisticRegression
classificador = LogisticRegression(solver='lbfgs')
classificador.fit(previsores_treinamento, classe_treinamento)
previsoes = classificador.predict(previsores_teste)

In [30]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
precisao = accuracy_score(classe_teste, previsoes)
matriz = confusion_matrix(classe_teste, previsoes)
print("Accuracy", precisao, "\n")
print("Matrix de Confusao\n",matriz, "\n")
print("Matrix de Confusao Porcentagem\n",matriz/matriz.sum(), "\n")
print(classification_report(classe_teste,previsoes))

Accuracy 0.849539406345957 

Matrix de Confusao
 [[3423  270]
 [ 465  727]] 

Matrix de Confusao Porcentagem
 [[0.70071648 0.05527124]
 [0.09518936 0.14882293]] 

              precision    recall  f1-score   support

           0       0.88      0.93      0.90      3693
           1       0.73      0.61      0.66      1192

    accuracy                           0.85      4885
   macro avg       0.80      0.77      0.78      4885
weighted avg       0.84      0.85      0.84      4885



# QUESTIONS ENGLISH ML-AZ

#### Logistic Regression Intuition

Is Logistic Regression a linear or non linear model?

It is a linear model. You will visualize this at the end of the section when seeing that the classifier’s separator
is a straight line.

What are the Logistic Regression assumptions?

First, binary logistic regression requires the dependent variable to be binary and ordinal logistic regression
requires the dependent variable to be ordinal.

Second, logistic regression requires the observations to be independent of each other. In other words, the
observations should not come from repeated measurements or matched data.

Third, logistic regression requires there to be little or no multicollinearity among the independent variables.
This means that the independent variables should not be too highly correlated with each other.
Fourth, logistic regression assumes linearity of independent variables and log odds. although this analy-
sis does not require the dependent and independent variables to be related linearly, it requires that the
independent variables are linearly related to the log odds.

Can Logistic Regression be used for many independent variables as well?

Yes, Logistic Regression can be used for as many independent variables as you want. However be aware that
you won’t be able to visualize the results in more than 3 dimensions.

#### Logistic Regression in Python

What does the fit method do here?
The fit method will basically train the Logistic Regression model on the training data. Therefore it will
compute and get the weights (coefficients) of the Logistic Regression model (see the Intuition Lecture) for
that particular set of training data composed of X_train and y_train. Then right after it collects the
weights/coefficients, you have a Logistic Regression model fully trained on your training data, and ready to
predict new outcomes thanks to the predict method.

We predicted the outcomes of a set of observations (the test set). How do we do the same for
a single observation, to predict a single outcome?

Let’s say this observation has the following features: Age = 30, Estimated Salary = 50000.
Then the code to get the predicted outcome would be the following (notice how we must not forget to scale
that single observation first):

`y_pred = classifier.predict(sc_X.transform(np.array([[20, 50000]])))`

Is the Confusion Matrix the optimal way to evaluate the performance of the model?

No, it just gives you an idea of how well your model can perform. If you get a good confusion matrix with
few prediction errors on the test set, then there is a chance your model has a good predictive power. However
the most relevant way to evaluate your model is through K-Fold Cross Validation, which you will see in Part
10. It consists of evaluating your model on several test sets (called the validation sets), so that we can make
sure we don’t get lucky on one single test set. Today most Data Scientists or AI Developers evaluate their
model through K-Fold Cross Validation. However the technique is a different subject so I preferred to leave
it for Part 10 after we cover all the different models.