# Logistic Regression

## Why Not Just Use A Linear Regression?

### Assumptions for Linear Models:
- Gaussian distribution of residuals (errors)
- Y (target variable) is continuous on the prediction interval
![alt text](images/binary.png "Logo Title Text 1")

### Finding A Decision Boundary
![alt text](images/lr1.png "Logo Title Text 1")

### Log of Equal Odds 
![alt text](images/lr2.png "Logo Title Text 1")

### Logit Link Function
![alt text](images/lr3.png "Logo Title Text 1")

### Solving for Each Class (Binary Target)
![alt text](images/lr4.png "Logo Title Text 1")

### Log Likelihood
![alt text](images/lr5.png "Logo Title Text 1")

In [None]:
import numpy as np
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

In [1]:
poor = open("poor_amazon_toy_reviews.txt").readlines()
good = open("good_amazon_toy_reviews.txt").readlines()

good_reviews = list(map(lambda review: (review, 1), good))
poor_reviews = list(map(lambda review: (review, 0), poor))

all_reviews = good_reviews + poor_reviews
all_reviews_df = pd.DataFrame(all_reviews, columns=["review", "positive"])
all_reviews_df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'poor_amazon_toy_reviews.txt'

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 1), 
                             stop_words="english", 
                             max_features=1000,token_pattern='(?u)\\b[a-zA-Z][a-zA-Z]+\\b')

In [134]:
X = vectorizer.fit_transform(all_reviews_df["review"])
y = all_reviews_df["positive"].values
X

<114917x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 926619 stored elements in Compressed Sparse Row format>

In [135]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [139]:
y_pred = lr.predict(X)

# calculate accuracy
np.mean(y_pred == y)

from sklearn.metrics import confusion_matrix

confusion_matrix(y, y_pred)

array([[  9087,   3613],
       [  1048, 101169]])

## AUROC (Area Under the Receiver Operator Curve)

In [140]:
from sklearn.metrics import roc_auc_score
roc_auc_score(y, y_pred)

0.8526295566657286

In [143]:
data = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
data["TARGET"] = y

In [145]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(data)
X_train = train_df.loc[:, ~train_df.columns.isin(['TARGET'])]
X_test = test_df.loc[:, ~test_df.columns.isin(['TARGET'])]


y_train = train_df["TARGET"]
y_test = test_df["TARGET"]

In [147]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(86187, 1000)
(86187,)
(28730, 1000)
(28730,)


In [148]:
lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [149]:
y_pred = lr.predict(X_test)

np.mean(y_pred == y_test)

0.9578837452140619

## Cross Validation

In [153]:
from sklearn.model_selection import cross_validate
X = data.loc[:, ~data.columns.isin(['TARGET'])]
cv_results = cross_validate(lr, X, y, cv=10,return_train_score=False)

In [154]:
cv_results['test_score']

array([0.9550992 , 0.95475113, 0.95744866, 0.95544727, 0.95475113,
       0.95857988, 0.95466411, 0.95570446, 0.95709686, 0.95561744])