**과적합을 방지하기 위해 test data는 데이터를 분석할 때, 학습할 때 절대 영향을 끼쳐서는 안된다는 것을 알지 못했을 때 작성했다.
<br>이후 수정했지만, 중간에 total_df(test data+train data)를 사용해 데이터를 분석하거나 total_df가 학습에 영향을 끼치는 부분이 있다면 train_df으로 바꿔주어야 한다.**

**전처리는 total_df를 사용해 한번에 전처리했다.**

In [4]:
import pandas as pd
import numpy as np
import os

In [6]:
# titanic 디렉토리 내 파일들 가져오기
files = [file_name for file_name in os.listdir("./titanic")]
files

['gender_submission.csv', 'test.csv', 'train.csv']

In [8]:
# files의 csv들을 dataframe으로 읽어오기
df_list = [pd.read_csv(os.path.join("titanic", file_name)) for file_name in files]
df_list

[     PassengerId  Survived
 0            892         0
 1            893         1
 2            894         0
 3            895         0
 4            896         1
 ..           ...       ...
 413         1305         0
 414         1306         1
 415         1307         0
 416         1308         0
 417         1309         0
 
 [418 rows x 2 columns],
      PassengerId  Pclass                                          Name  \
 0            892       3                              Kelly, Mr. James   
 1            893       3              Wilkes, Mrs. James (Ellen Needs)   
 2            894       2                     Myles, Mr. Thomas Francis   
 3            895       3                              Wirz, Mr. Albert   
 4            896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)   
 ..           ...     ...                                           ...   
 413         1305       3                            Spector, Mr. Woolf   
 414         1306       1            

In [9]:
# index 새로해서, train_data와 test_data 합치기
total_df = pd.concat([df_list[2], df_list[1]]).reset_index()
total_df

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,413,1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
1305,414,1306,,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
1306,415,1307,,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
1307,416,1308,,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


In [10]:
train_index = list(range(len(df_list[1])))
test_index = list(range(len(df_list[1]), len(total_df)))

## data 분석

### 데이터의 종류(양적 데이터, 질적 데이터)
#### 수치형 데이터
수학 연산을 할 수 있는 수치 값, 데이터의 속성을 그대로 나타내고 있음
- sum, max 의미 있음
- 연속형
    - ex. 키, 몸무게, 시간
- 이산형
    - ex. 사과의 개수, 책의 페이지 수

#### 범주형 데이터
범주로 나누어지는 값, 숫자로 나타낼 수는 있으나 아무 의미 없음
- sum, max 의미 없음
- 순서형
    - ex. 순위, 등급
- 명목형
    - ex. 성별(남/여), 우편 번호(서울/부산/제주)

In [138]:
train_df.head(1)

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S


#### 수치형 데이터
수학 연산을 할 수 있는 수치 값
- sum, max 의미 있음
- 연속형
    - ex. 
- 이산형
    - ex. Age, SibSp(타이타닉 호에 탑승한 형제/자매/배우자 수의 총 합), Parch(타이타닉 호에 탐승한 부모/자녀 수의 총 합), Fare(탑승요금)

#### 범주형 데이터(Qualitative, Categorical)
범주로 나누어지는 값
- sum, max 의미 없음
- 순서형
    - Pclass, Passengerld, Ticket(티켓번호), Cabin(객실번호)
- 명목형
    - Name, Sex, Embarked(승선항), Survived(생존여부)

#### y의 값(Survived)과 다른 변수들이 어떤 상관관계가 있는지 확인

In [11]:
train_df = df_list[2]

In [13]:
# 생존자들의 성별, Pclass, 승선지역 별 비율
columns = ["Sex", "Pclass", "Embarked"]
survived = train_df[train_df["Survived"]==1][columns].value_counts()/len(train_df[train_df['Survived']==1]) * 100

survived.unstack()

Unnamed: 0_level_0,Embarked,C,Q,S
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,1,12.280702,0.292398,13.450292
female,2,2.046784,0.584795,17.836257
female,3,4.385965,7.017544,9.649123
male,1,4.97076,,8.187135
male,2,0.584795,,4.385965
male,3,2.923977,0.877193,9.94152


In [14]:
survived = train_df[train_df['Survived']==1]["Sex"].value_counts() / len(train_df[train_df['Survived']==1]) * 100
survived

female    68.128655
male      31.871345
Name: Sex, dtype: float64

In [16]:
survived = train_df[train_df['Survived']==1]["Pclass"].value_counts() / len(train_df[train_df['Survived']==1]) * 100
survived

1    39.766082
3    34.795322
2    25.438596
Name: Pclass, dtype: float64

In [17]:
# 사망자들의 성별, Pclass, 승선지역 별 비율
dead = train_df[train_df['Survived']==0][columns].value_counts() / len(train_df[train_df['Survived']==1]) * 100
dead.unstack()

Unnamed: 0_level_0,Embarked,C,Q,S
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,1,0.292398,,0.584795
female,2,,,1.754386
female,3,2.339181,2.631579,16.081871
male,1,7.309942,0.292398,14.912281
male,2,2.339181,0.292398,23.976608
male,3,9.649123,10.526316,67.54386


### 두 변수들의 상관관계 알아보기
#### 공분산(Covariance)
- 두 변수 x의 편차와 y의 편차를 곱한 것의 평균
- 단위의 크기에 영향을 받는다.

#### 상관계수(Corrleation)
- 변수의 절대적 크기에 영향을 받지 않도록 단위화 시킨 것.
- 기본적으로 피어슨 상관계수가 이용된다.
    - r = x와 y가 함께 변하는 정도 / x와 y가 각각 변하는 정도    
        - 0.7 ~ 1.0 : 강한 양적 상관관계
        - 0.3 ~ 0.8 : 뚜렷한 양적 상관관계
        - 0.1 ~ 0.3 : 약한 양적 상관관계
        - -0.1 ~ 0.1 : 상관관계 거의 없음
        - -0.3 ~ -0.1 : 약한 음적 상관관계
        - -0.7 ~ -0.3 : 뚜렷한 음적 상관관계
        - -1.0 ~ -0.7 : 강한 음적 상관관계
    - https://gomguard.tistory.com/173

In [20]:
# 두 변수들의 상관관계 알아보기
train_df.corr()

Unnamed: 0,index,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_code
index,1.0,0.170654,-0.005007,-0.018212,0.012723,-0.027343,0.003911,-0.003723,-0.038626
PassengerId,0.170654,1.0,-0.005007,-0.038354,0.028814,-0.055224,0.008942,0.031428,-0.013406
Survived,-0.005007,-0.005007,1.0,-0.338481,-0.077221,-0.035322,0.081629,0.257307,0.543351
Pclass,-0.018212,-0.038354,-0.338481,1.0,-0.408106,0.060832,0.018322,-0.558629,-0.124617
Age,0.012723,0.028814,-0.077221,-0.408106,1.0,-0.243699,-0.150917,0.17874,-0.063645
SibSp,-0.027343,-0.055224,-0.035322,0.060832,-0.243699,1.0,0.373587,0.160238,0.109609
Parch,0.003911,0.008942,0.081629,0.018322,-0.150917,0.373587,1.0,0.221539,0.213125
Fare,-0.003723,0.031428,0.257307,-0.558629,0.17874,0.160238,0.221539,1.0,0.185523
Sex_code,-0.038626,-0.013406,0.543351,-0.124617,-0.063645,0.109609,0.213125,0.185523,1.0


In [19]:
# Sex_code 생성 male->0, female->1 
train_df["Sex_code"] = train_df.Sex.map({"male":0, "female":1})

### 범주형 - 범주형

#### Phi correlation
- 비교대상 범주 대상이 2개

#### Cramer's V
- 비교대상 범주 3개 이상

### 범주형 - 연속형

#### Point biserial correlation
**↑ Pclass(명명척도), Fare(연속형 변수)**
- 두 개 변수 중 하나는 범주형 변수이고, 다른 하나는 연속형 변수일 때
- ex. 성별과 수학점수와의 관계

#### Biserial correlation
- 두 개 변수 중 하나는 명명척도이고 하나는 연속변수일 때
- 명명척도의 유목은 인위적 구분하는 이분변수
- ex. 우열반 편성여부와 중간고사 점수와의 상관관계

#### Polyserial correlation
- 두 개 변수 중 하나는 명명척도이고 하나는 연속변수일 때
- 명명척도의 유목은 비인위적이며 3개 이상의 유목
- ex. 인종과 키와의 상관관계

https://dodonam.tistory.com/217

In [21]:
from scipy import stats
stats.pointbiserialr(train_df.loc[train_df["Fare"].notnull(), "Pclass"], train_df.loc[train_df["Fare"].notnull(), "Fare"])

PointbiserialrResult(correlation=-0.5586287323271726, pvalue=3.2662678942758068e-108)

In [22]:
# Pclass별 Fare(승선료)의 평균을 구해라.
train_df[["Pclass", "Fare"]].groupby(["Pclass"]).mean()

Unnamed: 0_level_0,Fare
Pclass,Unnamed: 1_level_1
1,87.508992
2,21.179196
3,13.302889


In [26]:
# 수치형 데이터의 통계치 확인
numeric_list = ["Age", "SibSp", "Parch", "Fare"]
train_df[numeric_list].describe()

Unnamed: 0,Age,SibSp,Parch,Fare
count,1046.0,1309.0,1309.0,1308.0
mean,29.881138,0.498854,0.385027,33.295479
std,14.413493,1.041658,0.86556,51.758668
min,0.17,0.0,0.0,0.0
25%,21.0,0.0,0.0,7.8958
50%,28.0,0.0,0.0,14.4542
75%,39.0,1.0,0.0,31.275
max,80.0,8.0,9.0,512.3292


### 데이터 전처리
- null 값 처리
- 문자형 데이터(object) -> str
    - one-hot encoding
    - gen_dummies()
- https://cyc1am3n.github.io/2018/10/09/my-first-kaggle-competition_titanic.html

In [213]:
total_df.dtypes

index              int64
PassengerId        int64
Survived         float64
Sex               object
Age              float64
SibSp              int64
Parch              int64
Fare             float64
Embarked          object
Sex_code           int64
Embarked_code      int64
dtype: object

**~~Name~~, Sex, ~~Ticket~~, ~~Cabin~~, Embarked => object type**

### Sex Feature

In [27]:
total_df["Sex"] = total_df["Sex"].astype(str)

In [31]:
# 비어있는 데이터 확인
total_df.isnull().sum()

index             0
PassengerId       0
Survived        418
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          0
Sex_code          0
dtype: int64

### Embarked Feature

In [29]:
# train data에서 가장 많이 있는 데이터 보기, 각 embarked별 count
train_df["Embarked"].value_counts()

S    914
C    270
Q    123
Name: Embarked, dtype: int64

In [30]:
# Embarked 비어있는 값 S로 채우기
# total_df.loc[total_df["Embarked"].isnull(), "Embarked"] = 'S'
total_df['Embarked'] = total_df['Embarked'].fillna('S')

In [32]:
# Embarked column Embarked_code로 replace
Embarked_code = {em: i for i, em in enumerate(total_df["Embarked"].unique())}
total_df["Embarked_code"] = total_df.Embarked.map(Embarked_code)
total_df.head()

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_code,Embarked_code
0,0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0,0
1,1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1,1
2,2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1,0
3,3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1,0
4,4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0,0


In [33]:
total_df['Embarked'] = total_df['Embarked'].astype(str)

### Age Feature

In [34]:
# Age가 null인 경우 train data의 평균 나이로 대체
total_df.loc[total_df["Age"].isnull(), "Age"] = train_df["Age"].mean()

In [35]:
# 나이대 별로 분리
total_df.loc[total_df['Age'] <= 16, 'Age'] = 0
total_df.loc[(total_df['Age'] > 16) & (total_df['Age'] <= 32), 'Age'] = 1
total_df.loc[(total_df['Age'] > 32) & (total_df['Age'] <= 48), 'Age'] = 2
total_df.loc[(total_df['Age'] > 48) & (total_df['Age'] <= 64), 'Age'] = 3
total_df.loc[total_df['Age'] > 64, 'Age'] = 4

In [36]:
total_df["Age"]

0       1.0
1       2.0
2       1.0
3       2.0
4       2.0
       ... 
1304    1.0
1305    2.0
1306    2.0
1307    1.0
1308    1.0
Name: Age, Length: 1309, dtype: float64

### Fare Feature

In [62]:
total_df[total_df["Fare"].isnull()]

Unnamed: 0,index,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_code,Embarked_code


In [61]:
total_df.loc[total_df["Fare"].isnull(), "Fare"] = train_df[train_df["Pclass"]==3]["Pclass"].mean()

In [63]:
# 각 column들에 null 값 존재여부 확인
total_df.isnull().sum()

index               0
PassengerId         0
Survived          418
Pclass              0
Name                0
Sex                 0
Age                 0
SibSp               0
Parch               0
Ticket              0
Fare                0
Cabin            1014
Embarked            0
Sex_code            0
Embarked_code       0
dtype: int64

### Feature 선택

In [70]:
# 몇 몇 feature drop
drop_features = ['Name', 'Ticket', 'Cabin', 'Sex_code', 'Pclass']
total_df.drop(drop_features, axis=1, inplace=True)

In [71]:
total_df.dtypes

index              int64
PassengerId        int64
Survived         float64
Sex               object
Age              float64
SibSp              int64
Parch              int64
Fare             float64
Embarked          object
Embarked_code      int64
dtype: object

### train, test data 분리

In [73]:
# index를 이용해, train_df, test_df 분리
train_df = total_df.iloc[train_index]

In [74]:
train_df.isnull().sum()

index            0
PassengerId      0
Survived         0
Sex              0
Age              0
SibSp            0
Parch            0
Fare             0
Embarked         0
Embarked_code    0
dtype: int64

In [75]:
# index를 이용해, train_df, test_df 분리
test_df = total_df.iloc[test_index]

In [76]:
total_df.isnull().sum()

index              0
PassengerId        0
Survived         418
Sex                0
Age                0
SibSp              0
Parch              0
Fare               0
Embarked           0
Embarked_code      0
dtype: int64

In [77]:
# One-hot-encoding for categorical variables
train_df = pd.get_dummies(train_df)
test_df = pd.get_dummies(test_df)

수치형 데이터로만 변환을 하게 되면 서로 간의 관계성이 생기게 된다.
그러나 실제 데이터인 월요일, 화요일, 수요일 간에는 그러한 관계성이 없다!
따라서, 사실이 아닌 관계성으로 인해 잘못된 학습이 일어날 수 있으므로
서로 무관한 수, 즉 더미로 만든 가변수로 변환함으로서 그러한 문제를 막아준다!

판다스(pandas)에서는 손쉽게 더미의 가변수를 만들 수 있도록 get_dummies함수를 제공하고 있다.

출처: https://devuna.tistory.com/67 [튜나 개발일기📚]

In [86]:
from sklearn.linear_model import LinearRegression, LogisticRegression

In [79]:
train_label = train_df['Survived']

In [80]:
train_data = train_df.drop('Survived', axis=1)

In [81]:
test_data = test_df.drop('Survived', axis=1)

In [82]:
def train_and_test(model):
    model.fit(train_data, train_label)
    prediction = model.predict(test_data)
    accuracy = round(model.score(train_data, train_label) * 100, 2)
    print("Accuracy : ", accuracy, "%")
    return prediction

In [89]:
# Logistic Regression
log_pred = train_and_test(LogisticRegression())

Accuracy :  80.14 %


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
