#  ü§ñ Machine Learning
---

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Scikit_learn_logo_small.svg/1200px-Scikit_learn_logo_small.svg.png" width=350 height=500 />

---
- Pr√°cticamente todo se puede hacer con [scikit learn](https://scikit-learn.org/stable/getting_started.html)<br>
- Pueden acceder al Curso oficial de [Muller](https://www.cs.columbia.edu/~amueller/comsw4995s19/schedule/) (__Creador de Scikit-Learn__)


In [3]:
# Importo Dataset 
from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)

##  üí• Modelo Lineal


$$ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2  + \epsilon$$

In [4]:
from sklearn.linear_model import LinearRegression

# Inicializamos el objeto
lin_reg = LinearRegression()

# Entrenamos el modelo, Minimos Cuadrados (MCO)
lin_reg.fit(X,y);


# Coeficientes
print(lin_reg.coef_)

[-1.08011358e-01  4.64204584e-02  2.05586264e-02  2.68673382e+00
 -1.77666112e+01  3.80986521e+00  6.92224640e-04 -1.47556685e+00
  3.06049479e-01 -1.23345939e-02 -9.52747232e-01  9.31168327e-03
 -5.24758378e-01]


In [6]:
from sklearn.metrics import r2_score, mean_squared_error

# prediccion
y_pred = lin_reg.predict(X)

# R^2
r2_score(y, y_pred)

0.7406426641094094

In [8]:
# MSE
mean_squared_error(y, y_pred)

21.894831181729206

### ‚ú® Validacici√≥n Cruzada
----
<img src="https://www.researchgate.net/publication/326465007/figure/fig1/AS:649909518757888@1531961912055/Ten-fold-cross-validation-diagram-The-dataset-was-divided-into-ten-parts-and-nine-of.png" width=550 height=700 />

----

Es un m√©todo para obtener el **error de generalizaci√≥n** , sirve como una medici√≥n no sesgada de como le ir√≠a al modelo con datos nuevos. 


In [12]:
from sklearn.model_selection import train_test_split

# test y train split
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [13]:
from sklearn.model_selection import cross_val_score

# modelo lineal
lin_reg = LinearRegression()

# Validacion Cruzada
cvs = cross_val_score(lin_reg, X_train, y_train, cv=5)
print(cvs)

[0.72915679 0.7589366  0.70550005 0.38299608 0.80419528]


In [14]:
cvs.mean()

0.6761569617154473

In [15]:
cvs.std()

0.15023952960118234

## üë¨ K Vecinos M√°s Cercanos

Dada una observaci√≥n se seleccionan las k observaciones m√°s cercanas y estas votan a favor de su clase #Democracia. 

---
<img src="https://sharad-s.gitbooks.io/cs231n/assets/knn.png" width=550 height=700 />

---


In [40]:
from sklearn.datasets import load_iris
import numpy as np


X, y = load_iris(return_X_y=True )

X.shape, y.shape, np.unique(y)

((150, 4), (150,), array([0, 1, 2]))

In [41]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y) 

In [38]:
from sklearn.neighbors  import KNeighborsClassifier

# Inicializaci√≥n
knn = KNeighborsClassifier()

# Fit
knn.fit(X_train, y_train)


# Simple Testeo
knn.score(X_test, y_test)

0.9736842105263158

### üë®‚Äçüéì Normailizaci√≥n  
---
En el caso de KNN es importante normalizar los datos, cada metrica tendr√≠a diferente nivel de importancia si no. 


In [26]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

#Entrenar
scaler.fit(X_train)

#Transformar
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

## üéÜ  Pipelines
---
Sirve para acumular transformaciones, el √∫ltimo elemento puede ser un modelo. 

In [27]:
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(), KNeighborsClassifier() )
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

0.9473684210526315

Validaci√≥n cruzada con Pipelines: 

In [29]:
# Imports
from sklearn.pipeline import make_pipeline
import numpy as np

#Pipeline  
knn_pipe = make_pipeline(StandardScaler(), KNeighborsClassifier())

# Validacion Cruzada 
scores = cross_val_score(knn_pipe, X_train, y_train, cv=10)
np.mean(scores), np.std(scores)

(0.9666666666666666, 0.04082482904638632)

### Gridsearch -  Optimizaci√≥n Hiperpar√°metros

In [50]:
from sklearn.model_selection import GridSearchCV

# Pipeline
knn_pipe = make_pipeline(StandardScaler(), KNeighborsClassifier())

# Grid de Par√°metros
param_grid = {'kneighborsclassifier__n_neighbors':[1,2,3,4,5,6,7,8,9,10],
            'kneighborsclassifier__p':[1,2]}

# Inicializaci√≥n
grid = GridSearchCV(knn_pipe, param_grid, cv=10)

# Fit
grid.fit(X_train, y_train);

# Buscar Mejor Modelo
print(grid.best_params_)
print(grid.score(X_test, y_test))

{'kneighborsclassifier__n_neighbors': 3, 'kneighborsclassifier__p': 2}
0.9210526315789473




## Ridge Regression
---


In [52]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

X, y = load_boston(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [56]:
from sklearn.linear_model import Ridge


crs = cross_val_score(Ridge(), X_train, y_train, cv=10)

np.mean(crs), np.std(crs)

(0.6939820442382804, 0.1495442418985505)

In [67]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV


# Pipeline
pipe = make_pipeline(StandardScaler(), Ridge())


#Grid
params_Ridge = {'ridge__alpha': [1,0.1,0.01,0.001,0.0001,0] }

Ridge_GS = GridSearchCV(pipe, param_grid=params_Ridge, cv=5)

Ridge_GS.fit(X_train,y_train)
Ridge_GS.best_params_



{'ridge__alpha': 1}

In [62]:
Ridge_GS.score(X_train, y_train)

0.7514006155875652

## Lasso Regression
---

In [70]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV


# Pipeline
pipe = make_pipeline(StandardScaler(), Lasso())


#Grid
params_Ridge = {'lasso__alpha': [1,0.1,0.01,0.001,0.0001] }

lasso_G = GridSearchCV(pipe, param_grid=params_Ridge, cv=5)

lasso_G.fit(X_train,y_train)
lasso_G.best_params_



{'lasso__alpha': 0.01}

In [71]:
lasso_G.score(X_train, y_train)

0.7513211791189376

## Regresi√≥n L√≥gistica
---

In [73]:
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Caso Sencillo

In [88]:
from sklearn.linear_model import LogisticRegression

lg = LogisticRegression(solver="lbfgs", max_iter=10000)

lg.fit(X_train, y_train)

lg.score(X_test, y_test);

Caso Completo

In [84]:
from sklearn.linear_model import LogisticRegression


# Pipeline
pipe = make_pipeline(StandardScaler(), LogisticRegression(solver="lbfgs"))


#Grid
params_logistic = {'logisticregression__C': [2, 1.3, 1,0.1,0.01,0.001,0.0001] }

logistic_G = GridSearchCV(pipe, param_grid=params_logistic , cv=5)

logistic_G.fit(X_train,y_train)
logistic_G.best_params_ , logistic_G.score(X_train,y_train)

({'logisticregression__C': 1}, 0.984251968503937)