# Maximum Likelihood Estimation (최우추정법)
- 우도의 최대화 관점에서 Cost Function 정의하기
  - 우도 : 사건이 일어날 가능도라고 이해하면 됨

- y값이 0 또는 1의 값을 가지기 때문에 이산확률변수이고 베르누이 분포를 따른다.
- 우도(Likelihood) : $p^y(1-p)^{1-y}$
  - 어떤 사건이 일어날 확률과 일어나지 않을 확률의 곱
  - 여기서 확률은 sigmoid function의 결과값.

- 우도방정식 $L(\theta)$를 최대화하는 $\theta$를 찾는 과정.
- 근데 여기서 계산의 편의를 위해 우도에 $log$를 붙여 $L(\theta)$을 정의하게 되고 그 값이 최대하 되는 $\theta$를 구함

- $L(\theta)$ = $\displaystyle\sum_{i=1}^{m}{-\log(1+e^{x_i\theta})} + \displaystyle\sum_{i=1}^{m}y_ix_i\theta$
- 이는 $J(\theta)$와 거의 같음.
- $J(\theta) = -\frac{1}{m}\displaystyle\sum_{i=1}^{m}{[y_i\theta{x}^{(i)} - \log(1+e^{\theta{x}^{(i)}})]}$

- Partial derivation of MLE : $\frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{m}\displaystyle\sum_{i=1}^{m}{(h_\theta(x^i) - y^i)x_j^i}$
  - Partial derivation of Cost function 과 같음

---

# LGR for scikit-learn

In [2]:
import pandas as pd

In [9]:
data_url = "./uva.txt"
df = pd.read_csv(data_url, sep='\t')

In [10]:
df.head()

Unnamed: 0,who,Newbie,Age,Gender,Household Income,Sexual Preference,Country,Education Attainment,Major Occupation,Marital Status,Years on Internet
0,id74364,0,54.0,Male,$50-74,Gay male,Ontario,Some College,Computer,Other,4-6 yr
1,id84505,0,39.0,Female,Over $100,Heterosexual,Sweden,Professional,Other,Other,1-3 yr
2,id84509,1,49.0,Female,$40-49,Heterosexual,Washington,Some College,Management,Other,Under 6 mo
3,id87028,1,22.0,Female,$40-49,Heterosexual,Florida,Some College,Computer,Married,6-12 mo
4,id76087,0,20.0,Male,$30-39,Bisexual,New Jersey,Some College,Education,Single,1-3 yr


In [11]:
df.pop('who')
df.pop('Country')
df.pop('Years on Internet')

df.dtypes

Newbie                    int64
Age                     float64
Gender                   object
Household Income         object
Sexual Preference        object
Education Attainment     object
Major Occupation         object
Marital Status           object
dtype: object

In [13]:
for col in ['Gender','Household Income', 'Sexual Preference', 'Education Attainment', 'Major Occupation', 'Marital Status']:
  df[col] = df[col].astype('category')

df.dtypes


Newbie                     int64
Age                      float64
Gender                  category
Household Income        category
Sexual Preference       category
Education Attainment    category
Major Occupation        category
Marital Status          category
dtype: object

In [14]:
df_modified = pd.get_dummies(df)
df_modified.head()

Unnamed: 0,Newbie,Age,Gender_Female,Gender_Male,Household Income_$10-19,Household Income_$20-29,Household Income_$30-39,Household Income_$40-49,Household Income_$50-74,Household Income_$75-99,...,Major Occupation_Education,Major Occupation_Management,Major Occupation_Other,Major Occupation_Professional,Marital Status_Divorced,Marital Status_Married,Marital Status_Other,Marital Status_Separated,Marital Status_Single,Marital Status_Widowed
0,0,54.0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,0
1,0,39.0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
2,1,49.0,1,0,0,0,0,1,0,0,...,0,1,0,0,0,0,1,0,0,0
3,1,22.0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,20.0,0,1,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0


In [15]:
df_modified.shape

(19583, 38)

In [16]:
df_modified.isnull().sum()

Newbie                                 0
Age                                  561
Gender_Female                          0
Gender_Male                            0
Household Income_$10-19                0
Household Income_$20-29                0
Household Income_$30-39                0
Household Income_$40-49                0
Household Income_$50-74                0
Household Income_$75-99                0
Household Income_Over $100             0
Household Income_Under $10             0
Sexual Preference_Bisexual             0
Sexual Preference_Gay male             0
Sexual Preference_Heterosexual         0
Sexual Preference_Lesbian              0
Sexual Preference_Transgender          0
Sexual Preference_na                   0
Education Attainment_College           0
Education Attainment_Doctoral          0
Education Attainment_Grammar           0
Education Attainment_High School       0
Education Attainment_Masters           0
Education Attainment_Other             0
Education Attain

- 결측치 채우기

In [17]:
df_modified.loc[
  pd.isnull(df_modified['Age']), 'Age'] = df_modified['Age'].mean()

In [18]:
df_modified['Age'][pd.isnull(df_modified["Age"])]

Series([], Name: Age, dtype: float64)

- x y 분리

In [23]:
x_data = df_modified.iloc[:, 1:].values
y_data = df_modified.iloc[:, 0].values.reshape(-1, 1)
y_data.shape, x_data.shape

((19583, 1), (19583, 37))

- Min-Max Standardization

In [24]:
from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()
x_data = min_max_scaler.fit_transform(x_data)

In [25]:
y_data

array([[0],
       [0],
       [1],
       ...,
       [0],
       [1],
       [0]], dtype=int64)

- train-test split

In [26]:
import numpy as np

training_idx = np.random.randint(y_data.shape[0], size=int(y_data.shape[0] * 0.8))
test_idx = np.random.randint(y_data.shape[0], size=int(y_data.shape[0] * 0.2))

x_training, x_test = x_data[training_idx, :], x_data[test_idx, :]
y_training, y_test = y_data[training_idx, :], y_data[test_idx, :]

x_training.shape, x_test.shape

((15666, 37), (3916, 37))

In [33]:
from sklearn import linear_model, datasets

logreg = linear_model.LogisticRegression(fit_intercept = True)
logreg.fit(x_training, y_training.flatten())

LogisticRegression()

In [34]:
logreg.predict(x_test[:3])

array([0, 0, 0], dtype=int64)

In [35]:
logreg.predict_proba(x_test[:3])

array([[0.78392915, 0.21607085],
       [0.92235048, 0.07764952],
       [0.91556906, 0.08443094]])

In [38]:
sum(logreg.predict(x_test) == y_test.flatten()) / len(y_test)

0.7607252298263534

In [39]:
logreg.decision_function(x_test[:5])

array([-1.28871229, -2.47471998, -2.38361193, -2.51840182, -2.19298309])