
# Logistic Regression Practice
**Possums**

<img src="./images/pos2.jpg" style="height: 250px">

*The common brushtail possum (Trichosurus vulpecula, from the Greek for "furry tailed" and the Latin for "little fox", previously in the genus Phalangista) is a nocturnal, semi-arboreal marsupial of the family Phalangeridae, native to Australia, and the second-largest of the possums.* -[Wikipedia](https://en.wikipedia.org/wiki/Common_brushtail_possum)

In [1]:
# imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

# Import train_test_split.
from sklearn.model_selection import train_test_split

# Import logistic regression
from sklearn.linear_model import LogisticRegression


### Get the data

Read in the `possum.csv` data (located in the `data` folder).

In [8]:
possums=  pd.read_csv('data/possum.csv')
possums.head()

Unnamed: 0,site,pop,sex,age,head_l,skull_w,total_l,tail_l
0,1,Vic,m,8.0,94.1,60.4,89.0,36.0
1,1,Vic,f,6.0,92.5,57.6,91.5,36.5
2,1,Vic,f,6.0,94.0,60.0,95.5,39.0
3,1,Vic,f,6.0,93.2,57.1,92.0,38.0
4,1,Vic,f,2.0,91.5,56.3,85.5,36.0


### Preprocessing

> Check for & deal with any missing values.  
Convert categorical columns to numeric.  
Do any other preprocessing you feel is necessary.

In [9]:
possums.isnull().sum()

site       0
pop        0
sex        0
age        2
head_l     0
skull_w    0
total_l    0
tail_l     0
dtype: int64

In [10]:
possums.dropna(inplace=True)

In [11]:
possums['pop'].value_counts()

pop
other    58
Vic      44
Name: count, dtype: int64

In [12]:
possums['pop'] = possums['pop'].map({'other':0, 'Vic':1})
possums['sex'] = possums['sex'].map({'m':0, 'f':1})
possums.head()

Unnamed: 0,site,pop,sex,age,head_l,skull_w,total_l,tail_l
0,1,1,0,8.0,94.1,60.4,89.0,36.0
1,1,1,1,6.0,92.5,57.6,91.5,36.5
2,1,1,1,6.0,94.0,60.0,95.5,39.0
3,1,1,1,6.0,93.2,57.1,92.0,38.0
4,1,1,1,2.0,91.5,56.3,85.5,36.0


### Modeling

> Build Logistic Regression model to predict `pop`; region of origin.  
Examine the performance of the model.

In [19]:
X = possums.drop(columns='pop')
y = possums['pop']

X_train,  X_test, y_train, y_test = train_test_split(X, y, random_state= 42)

In [22]:
logreg= LogisticRegression(solver= 'newton-cg')

logreg.fit(X_train, y_train)

In [23]:
logreg.score(X_train, y_train)

1.0

In [24]:
logreg.score(X_test, y_test)

0.9615384615384616

### Interpretation & Predictions

> Interpret at least one coefficient from your model.  
> Generate predicted probabilities for your testing set.  
> Generate predictions for your testing set.

In [27]:
pd.Series(logreg.coef_[0], index= X.columns)

site      -2.080073
sex       -0.122902
age        0.085735
head_l    -0.155197
skull_w   -0.159106
total_l    0.157245
tail_l    -0.660194
dtype: float64

In [28]:
np.exp(0.085735)

#for every one unit increase in age, the odds increase by 1.09 times. 

1.0895175678905726

In [29]:
logreg.predict(X_test)

array([1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       1, 0, 1, 0])

In [30]:
logreg.predict_proba(X_test)

array([[6.22037876e-03, 9.93779621e-01],
       [9.92282343e-01, 7.71765721e-03],
       [9.94511083e-01, 5.48891701e-03],
       [6.60162480e-01, 3.39837520e-01],
       [1.19995846e-02, 9.88000415e-01],
       [1.16985949e-01, 8.83014051e-01],
       [9.99901171e-01, 9.88290609e-05],
       [6.31897984e-01, 3.68102016e-01],
       [4.63700670e-02, 9.53629933e-01],
       [1.30034213e-02, 9.86996579e-01],
       [3.70493598e-03, 9.96295064e-01],
       [3.81657509e-02, 9.61834249e-01],
       [9.99856149e-01, 1.43850574e-04],
       [9.99905263e-01, 9.47368738e-05],
       [9.95113010e-01, 4.88699022e-03],
       [1.48085962e-02, 9.85191404e-01],
       [9.96639147e-01, 3.36085331e-03],
       [1.04752376e-01, 8.95247624e-01],
       [1.67045723e-02, 9.83295428e-01],
       [5.37403856e-03, 9.94625961e-01],
       [9.99818713e-01, 1.81286655e-04],
       [9.81941837e-01, 1.80581625e-02],
       [8.19436808e-03, 9.91805632e-01],
       [9.97523891e-01, 2.47610870e-03],
       [4.560178

In [31]:
possums['site'].value_counts()

site
1    33
7    18
5    13
6    13
2    11
3     7
4     7
Name: count, dtype: int64

In [38]:
pd.get_dummies(possums.site).astype(int)

Unnamed: 0,1,2,3,4,5,6,7
0,1,0,0,0,0,0,0
1,1,0,0,0,0,0,0
2,1,0,0,0,0,0,0
3,1,0,0,0,0,0,0
4,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...
99,0,0,0,0,0,0,1
100,0,0,0,0,0,0,1
101,0,0,0,0,0,0,1
102,0,0,0,0,0,0,1


In [32]:
logreg.predict(X_test)

array([1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       1, 0, 1, 0])

30     1
69     0
64     0
49     0
42     1
40     1
92     0
47     0
10     1
0      1
18     1
31     1
99     0
87     0
78     0
4      1
81     0
33     1
12     1
26     1
102    0
55     0
22     1
70     0
46     0
100    0
Name: pop, dtype: int64