# MLlib
- MLlib으로 모델링하기 위한 데이터 준비과정
- 통계적 테스트 수행
- 로지스틱회귀를 통한 유아 생존율 예측
- 변수 선택과 RandomForest모형 학습

## 패키지 개요
MLlib 사용은 3단계를 갖음
- 데이터 전처러 : 변수 추출, 변형, 선택, 범주형 변수에 대한 해싱, 자연어 처리 기술
- 머신러닝 알고리즘 : 몇몇 유명한 알고리즘이 개발되어 있음
- 유틸리티 : 기술통계, 카이제곱 검정, 선형대수, 모델 평가 방법론과 같은 통계 방법

## 데이터 로딩과 변형
MLlib이 RDD와 DStream에 중점을 두고 디자인되기는 했지만, 데이터 변형을 위해 데이터프레임으로 변형하여 작업

In [1]:
import pyspark.sql.types as typ
labels = [
    ('INFANT_ALIVE_AT_REPORT', typ.StringType()),
    ('BIRTH_YEAR', typ.IntegerType()),
    ('BIRTH_MONTH', typ.IntegerType()),
    ('BIRTH_PLACE', typ.StringType()),
    ('MOTHER_AGE_YEARS', typ.IntegerType()),
    ('MOTHER_RACE_6CODE', typ.StringType()),
    ('MOTHER_EDUCATION', typ.StringType()),
    ('FATHER_COMBINED_AGE', typ.IntegerType()),
    ('FATHER_EDUCATION', typ.StringType()),
    ('MONTH_PRECARE_RECODE', typ.StringType()),
    ('CIG_BEFORE', typ.IntegerType()),
    ('CIG_1_TRI', typ.IntegerType()),
    ('CIG_2_TRI', typ.IntegerType()),
    ('CIG_3_TRI', typ.IntegerType()),
    ('MOTHER_HEIGHT_IN', typ.IntegerType()),
    ('MOTHER_BMI_RECODE', typ.IntegerType()),
    ('MOTHER_PRE_WEIGHT', typ.IntegerType()),
    ('MOTHER_DELIVERY_WEIGHT', typ.IntegerType()),
    ('MOTHER_WEIGHT_GAIN', typ.IntegerType()),
    ('DIABETES_PRE', typ.StringType()),
    ('DIABETES_GEST', typ.StringType()),
    ('HYP_TENS_PRE', typ.StringType()),
    ('HYP_TENS_GEST', typ.StringType()),
    ('PREV_BIRTH_PRETERM', typ.StringType()),
    ('NO_RISK', typ.StringType()),
    ('NO_INFECTIONS_REPORTED', typ.StringType()),
    ('LABOR_IND', typ.StringType()),
    ('LABOR_AUGM', typ.StringType()),
    ('STEROIDS', typ.StringType()),
    ('ANTIBIOTICS', typ.StringType()),
    ('ANESTHESIA', typ.StringType()),
    ('DELIV_METHOD_RECODE_COMB', typ.StringType()),
    ('ATTENDANT_BIRTH', typ.StringType()),
    ('APGAR_5', typ.IntegerType()),
    ('APGAR_5_RECODE', typ.StringType()),
    ('APGAR_10', typ.IntegerType()),
    ('APGAR_10_RECODE', typ.StringType()),
    ('INFANT_SEX', typ.StringType()),
    ('OBSTETRIC_GESTATION_WEEKS', typ.IntegerType()),
    ('INFANT_WEIGHT_GRAMS', typ.IntegerType()),
    ('INFANT_ASSIST_VENTI', typ.StringType()),
    ('INFANT_ASSIST_VENTI_6HRS', typ.StringType()),
    ('INFANT_NICU_ADMISSION', typ.StringType()),
    ('INFANT_SURFACANT', typ.StringType()),
    ('INFANT_ANTIBIOTICS', typ.StringType()),
    ('INFANT_SEIZURES', typ.StringType()),
    ('INFANT_NO_ABNORMALITIES', typ.StringType()),
    ('INFANT_ANCEPHALY', typ.StringType()),
    ('INFANT_MENINGOMYELOCELE', typ.StringType()),
    ('INFANT_LIMB_REDUCTION', typ.StringType()),
    ('INFANT_DOWN_SYNDROME', typ.StringType()),
    ('INFANT_SUSPECTED_CHROMOSOMAL_DISORDER', typ.StringType()),
    ('INFANT_NO_CONGENITAL_ANOMALIES_CHECKED', typ.StringType()),
    ('INFANT_BREASTFED', typ.StringType())
]

schema = typ.StructType([
    typ.StructField(e[0], e[1], False) for e in labels
])

In [2]:
births = spark.read.csv('births_train.csv.gz', header=True, schema=schema)
births.printSchema()

root
 |-- INFANT_ALIVE_AT_REPORT: string (nullable = true)
 |-- BIRTH_YEAR: integer (nullable = true)
 |-- BIRTH_MONTH: integer (nullable = true)
 |-- BIRTH_PLACE: string (nullable = true)
 |-- MOTHER_AGE_YEARS: integer (nullable = true)
 |-- MOTHER_RACE_6CODE: string (nullable = true)
 |-- MOTHER_EDUCATION: string (nullable = true)
 |-- FATHER_COMBINED_AGE: integer (nullable = true)
 |-- FATHER_EDUCATION: string (nullable = true)
 |-- MONTH_PRECARE_RECODE: string (nullable = true)
 |-- CIG_BEFORE: integer (nullable = true)
 |-- CIG_1_TRI: integer (nullable = true)
 |-- CIG_2_TRI: integer (nullable = true)
 |-- CIG_3_TRI: integer (nullable = true)
 |-- MOTHER_HEIGHT_IN: integer (nullable = true)
 |-- MOTHER_BMI_RECODE: integer (nullable = true)
 |-- MOTHER_PRE_WEIGHT: integer (nullable = true)
 |-- MOTHER_DELIVERY_WEIGHT: integer (nullable = true)
 |-- MOTHER_WEIGHT_GAIN: integer (nullable = true)
 |-- DIABETES_PRE: string (nullable = true)
 |-- DIABETES_GEST: string (nullable = true)


### 종속변수의 수준을 확인

In [3]:
births.groupby('INFANT_ALIVE_AT_REPORT').count().show()

+----------------------+-----+
|INFANT_ALIVE_AT_REPORT|count|
+----------------------+-----+
|                     Y|23349|
|                     N|22080|
+----------------------+-----+



종속변수의 데이터들이 상대적으로 균형이 있는 것으로 보인다

카테고리형 데이터의 수준을 연속형 변수로 변경하는 작업을 수행하기 위해 딕셔너리를 생성

In [4]:
recode_dictionary = {
    'YNU' : {
        'Y' : 1,
        'N' : 0,
        'U' : 0
    }
}

필요하다고 판단되는 변수만 그냥 선택해봄!!!!

In [5]:
selected_features = [
    'INFANT_ALIVE_AT_REPORT', 
    'BIRTH_PLACE', 
    'MOTHER_AGE_YEARS', 
    'FATHER_COMBINED_AGE', 
    'CIG_BEFORE', 
    'CIG_1_TRI', 
    'CIG_2_TRI', 
    'CIG_3_TRI', 
    'MOTHER_HEIGHT_IN', 
    'MOTHER_PRE_WEIGHT', 
    'MOTHER_DELIVERY_WEIGHT', 
    'MOTHER_WEIGHT_GAIN', 
    'DIABETES_PRE', 
    'DIABETES_GEST', 
    'HYP_TENS_PRE', 
    'HYP_TENS_GEST', 
    'PREV_BIRTH_PRETERM'
]

births_trimmed = births.select(selected_features)
births_trimmed.printSchema()

root
 |-- INFANT_ALIVE_AT_REPORT: string (nullable = true)
 |-- BIRTH_PLACE: string (nullable = true)
 |-- MOTHER_AGE_YEARS: integer (nullable = true)
 |-- FATHER_COMBINED_AGE: integer (nullable = true)
 |-- CIG_BEFORE: integer (nullable = true)
 |-- CIG_1_TRI: integer (nullable = true)
 |-- CIG_2_TRI: integer (nullable = true)
 |-- CIG_3_TRI: integer (nullable = true)
 |-- MOTHER_HEIGHT_IN: integer (nullable = true)
 |-- MOTHER_PRE_WEIGHT: integer (nullable = true)
 |-- MOTHER_DELIVERY_WEIGHT: integer (nullable = true)
 |-- MOTHER_WEIGHT_GAIN: integer (nullable = true)
 |-- DIABETES_PRE: string (nullable = true)
 |-- DIABETES_GEST: string (nullable = true)
 |-- HYP_TENS_PRE: string (nullable = true)
 |-- HYP_TENS_GEST: string (nullable = true)
 |-- PREV_BIRTH_PRETERM: string (nullable = true)



In [6]:
births_trimmed.take(3)

[Row(INFANT_ALIVE_AT_REPORT='N', BIRTH_PLACE='1', MOTHER_AGE_YEARS=29, FATHER_COMBINED_AGE=99, CIG_BEFORE=99, CIG_1_TRI=99, CIG_2_TRI=99, CIG_3_TRI=99, MOTHER_HEIGHT_IN=99, MOTHER_PRE_WEIGHT=999, MOTHER_DELIVERY_WEIGHT=999, MOTHER_WEIGHT_GAIN=99, DIABETES_PRE='N', DIABETES_GEST='N', HYP_TENS_PRE='N', HYP_TENS_GEST='N', PREV_BIRTH_PRETERM='N'),
 Row(INFANT_ALIVE_AT_REPORT='N', BIRTH_PLACE='1', MOTHER_AGE_YEARS=22, FATHER_COMBINED_AGE=29, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=65, MOTHER_PRE_WEIGHT=180, MOTHER_DELIVERY_WEIGHT=198, MOTHER_WEIGHT_GAIN=18, DIABETES_PRE='N', DIABETES_GEST='N', HYP_TENS_PRE='N', HYP_TENS_GEST='N', PREV_BIRTH_PRETERM='N'),
 Row(INFANT_ALIVE_AT_REPORT='N', BIRTH_PLACE='1', MOTHER_AGE_YEARS=38, FATHER_COMBINED_AGE=40, CIG_BEFORE=0, CIG_1_TRI=0, CIG_2_TRI=0, CIG_3_TRI=0, MOTHER_HEIGHT_IN=63, MOTHER_PRE_WEIGHT=155, MOTHER_DELIVERY_WEIGHT=167, MOTHER_WEIGHT_GAIN=12, DIABETES_PRE='N', DIABETES_GEST='N', HYP_TENS_PRE='N', HYP_TENS_GEST=

In [7]:
births_trimmed.groupby('CIG_BEFORE').count().show()

+----------+-----+
|CIG_BEFORE|count|
+----------+-----+
|        31|    1|
|        28|    1|
|        44|    1|
|        12|   51|
|        22|    3|
|         1|  165|
|        13|    3|
|         6|  171|
|        16|    3|
|         3|  244|
|        20| 1509|
|        40|  127|
|         5|  504|
|        19|    1|
|        15|  122|
|         9|   12|
|        35|    1|
|         4|  190|
|         8|   88|
|        23|    2|
+----------+-----+
only showing top 20 rows



In [8]:
births_trimmed.groupby('CIG_3_TRI').count().show()

+---------+-----+
|CIG_3_TRI|count|
+---------+-----+
|       12|   20|
|       22|    1|
|        1|  147|
|       13|    2|
|        6|  112|
|        3|  211|
|       20|  411|
|       40|   17|
|        5|  438|
|       19|    1|
|       15|   47|
|        9|    9|
|        4|  139|
|        8|   40|
|        7|   73|
|       10|  910|
|       80|    2|
|       25|    2|
|       21|    1|
|       98|    9|
+---------+-----+
only showing top 20 rows



In [9]:
import pyspark.sql.functions as func

# 범주형 데이터의 수준을 숫자형으로 변경하기 위한 함수
def recode(col, key):
    return recode_dictionary[key][col]

# 담배를 피운 횟수에 대한 연속형 변수의 전처리를 위한 함수
# 99번 이상의 경우는 0으로 변경해주는 함수임
def correct_cig(feat):
    return func.when(func.col(feat) != 99, func.col(feat)).otherwise(0)

# 데이터프레임이 인식할 수 있게 데이터의 타입을 명시해주는 UDF를 정의
rec_integer = func.udf(recode, typ.IntegerType())
rec_integer # 함수 자체를 변수로 정의함!! 메모리에 참조되어 있는 것을 확인할 수 있음

<function __main__.recode>

In [10]:
births_transformed = births_trimmed.withColumn('CIG_BEFORE', correct_cig('CIG_BEFORE')).withColumn('CIG_1_TRI', correct_cig('CIG_1_TRI')).withColumn('CIG_2_TRI', correct_cig('CIG_2_TRI')).withColumn('CIG_3_TRI', correct_cig('CIG_3_TRI'))

### binning작업

In [11]:
cols = [(col.name, col.dataType) for col in births_trimmed.schema]
cols

[('INFANT_ALIVE_AT_REPORT', StringType),
 ('BIRTH_PLACE', StringType),
 ('MOTHER_AGE_YEARS', IntegerType),
 ('FATHER_COMBINED_AGE', IntegerType),
 ('CIG_BEFORE', IntegerType),
 ('CIG_1_TRI', IntegerType),
 ('CIG_2_TRI', IntegerType),
 ('CIG_3_TRI', IntegerType),
 ('MOTHER_HEIGHT_IN', IntegerType),
 ('MOTHER_PRE_WEIGHT', IntegerType),
 ('MOTHER_DELIVERY_WEIGHT', IntegerType),
 ('MOTHER_WEIGHT_GAIN', IntegerType),
 ('DIABETES_PRE', StringType),
 ('DIABETES_GEST', StringType),
 ('HYP_TENS_PRE', StringType),
 ('HYP_TENS_GEST', StringType),
 ('PREV_BIRTH_PRETERM', StringType)]

In [12]:
YNU_cols = []

# YNU가 존재하는 컬럼을 찾아내는 함수를 설계
for i, s in enumerate(cols):
    if s[1] == typ.StringType():
        dis = births.select(s[0]).distinct().rdd.map(lambda row: row[0]).collect()
        if 'Y' in dis:
            YNU_cols.append(s[0])

모든 YNU_cols를 한 번에 변형하기 위해 트랜스포메이션 리스트를 생성한다

In [13]:
# 연습용 : 결과를 확인하면 RDD의 형태인 것을 확인할 수 있음
births.select([
    'INFANT_NICU_ADMISSION', rec_integer('INFANT_NICU_ADMISSION', func.lit('YNU')).alias('INFANT_NICU_ADMISSION_RECODE')]).take(5)

[Row(INFANT_NICU_ADMISSION='Y', INFANT_NICU_ADMISSION_RECODE=1),
 Row(INFANT_NICU_ADMISSION='Y', INFANT_NICU_ADMISSION_RECODE=1),
 Row(INFANT_NICU_ADMISSION='U', INFANT_NICU_ADMISSION_RECODE=0),
 Row(INFANT_NICU_ADMISSION='N', INFANT_NICU_ADMISSION_RECODE=0),
 Row(INFANT_NICU_ADMISSION='U', INFANT_NICU_ADMISSION_RECODE=0)]

In [14]:
exprs_YNU = [rec_integer(x, func.lit('YNU')).alias(x) if x in YNU_cols else x for x in births_transformed.columns ]
exprs_YNU # schema가 변경된 것을 확인할 수 있음

[Column<b'recode(INFANT_ALIVE_AT_REPORT, YNU) AS `INFANT_ALIVE_AT_REPORT`'>,
 'BIRTH_PLACE',
 'MOTHER_AGE_YEARS',
 'FATHER_COMBINED_AGE',
 'CIG_BEFORE',
 'CIG_1_TRI',
 'CIG_2_TRI',
 'CIG_3_TRI',
 'MOTHER_HEIGHT_IN',
 'MOTHER_PRE_WEIGHT',
 'MOTHER_DELIVERY_WEIGHT',
 'MOTHER_WEIGHT_GAIN',
 Column<b'recode(DIABETES_PRE, YNU) AS `DIABETES_PRE`'>,
 Column<b'recode(DIABETES_GEST, YNU) AS `DIABETES_GEST`'>,
 Column<b'recode(HYP_TENS_PRE, YNU) AS `HYP_TENS_PRE`'>,
 Column<b'recode(HYP_TENS_GEST, YNU) AS `HYP_TENS_GEST`'>,
 Column<b'recode(PREV_BIRTH_PRETERM, YNU) AS `PREV_BIRTH_PRETERM`'>]

In [15]:
births_transformed.select(exprs_YNU)

DataFrame[INFANT_ALIVE_AT_REPORT: int, BIRTH_PLACE: string, MOTHER_AGE_YEARS: int, FATHER_COMBINED_AGE: int, CIG_BEFORE: int, CIG_1_TRI: int, CIG_2_TRI: int, CIG_3_TRI: int, MOTHER_HEIGHT_IN: int, MOTHER_PRE_WEIGHT: int, MOTHER_DELIVERY_WEIGHT: int, MOTHER_WEIGHT_GAIN: int, DIABETES_PRE: int, DIABETES_GEST: int, HYP_TENS_PRE: int, HYP_TENS_GEST: int, PREV_BIRTH_PRETERM: int]

In [16]:
births_transformed = births_transformed.select(exprs_YNU)
births_transformed.select(YNU_cols[-5:]).show(5)

+------------+-------------+------------+-------------+------------------+
|DIABETES_PRE|DIABETES_GEST|HYP_TENS_PRE|HYP_TENS_GEST|PREV_BIRTH_PRETERM|
+------------+-------------+------------+-------------+------------------+
|           0|            0|           0|            0|                 0|
|           0|            0|           0|            0|                 0|
|           0|            0|           0|            0|                 0|
|           0|            0|           0|            0|                 1|
|           0|            0|           0|            0|                 0|
+------------+-------------+------------+-------------+------------------+
only showing top 5 rows



## EDA작업 및 데이터 통계량 확인

### 기술통계량
 - 연속형 변수에 해당하는 변수를 추출하여 기술통계량을 확인
 - 범주형 변수의 수준별 집계를 확인

### 연속형 변수의 기술통계량 확인

In [17]:
births_trimmed.printSchema()

root
 |-- INFANT_ALIVE_AT_REPORT: string (nullable = true)
 |-- BIRTH_PLACE: string (nullable = true)
 |-- MOTHER_AGE_YEARS: integer (nullable = true)
 |-- FATHER_COMBINED_AGE: integer (nullable = true)
 |-- CIG_BEFORE: integer (nullable = true)
 |-- CIG_1_TRI: integer (nullable = true)
 |-- CIG_2_TRI: integer (nullable = true)
 |-- CIG_3_TRI: integer (nullable = true)
 |-- MOTHER_HEIGHT_IN: integer (nullable = true)
 |-- MOTHER_PRE_WEIGHT: integer (nullable = true)
 |-- MOTHER_DELIVERY_WEIGHT: integer (nullable = true)
 |-- MOTHER_WEIGHT_GAIN: integer (nullable = true)
 |-- DIABETES_PRE: string (nullable = true)
 |-- DIABETES_GEST: string (nullable = true)
 |-- HYP_TENS_PRE: string (nullable = true)
 |-- HYP_TENS_GEST: string (nullable = true)
 |-- PREV_BIRTH_PRETERM: string (nullable = true)



In [18]:
import pyspark.mllib.stat as st
import numpy as np

numeric_cols = ['MOTHER_AGE_YEARS','FATHER_COMBINED_AGE',
                'CIG_BEFORE','CIG_1_TRI','CIG_2_TRI','CIG_3_TRI',
                'MOTHER_HEIGHT_IN','MOTHER_PRE_WEIGHT',
                'MOTHER_DELIVERY_WEIGHT','MOTHER_WEIGHT_GAIN'
               ]

# rdd, 즉 하나의 인스턴스를 하나의 리스트로 정의하여 데이터셋으로 선언
numeric_rdd = births_transformed.select(numeric_cols).rdd.map(lambda row: [e for e in row])
numeric_rdd.take(3)

[[29, 99, 0, 0, 0, 0, 99, 999, 999, 99],
 [22, 29, 0, 0, 0, 0, 65, 180, 198, 18],
 [38, 40, 0, 0, 0, 0, 63, 155, 167, 12]]

In [19]:
mllib_stat = st.Statistics.colStats(numeric_rdd) # MultivariateStatisticalSummary

# 변수별 평균과 표준편차를 확인
for col, m, v in zip(numeric_cols, mllib_stat.mean(), mllib_stat.variance()):
    print('{0}: \t {1:.2f} \t {2:.2f}'.format(col, m, np.sqrt(v)))

MOTHER_AGE_YEARS: 	 28.30 	 6.08
FATHER_COMBINED_AGE: 	 44.55 	 27.55
CIG_BEFORE: 	 1.43 	 5.18
CIG_1_TRI: 	 0.91 	 3.83
CIG_2_TRI: 	 0.70 	 3.31
CIG_3_TRI: 	 0.58 	 3.11
MOTHER_HEIGHT_IN: 	 65.12 	 6.45
MOTHER_PRE_WEIGHT: 	 214.50 	 210.21
MOTHER_DELIVERY_WEIGHT: 	 223.63 	 180.01
MOTHER_WEIGHT_GAIN: 	 30.74 	 26.23


### 범주형 변수의 편향 확인

In [20]:
categorical_cols = [e for e in births_transformed.columns if e not in numeric_cols]

categorical_rdd = births_transformed.select(categorical_cols).rdd.map(lambda row: [e for e in row])

for i, col in enumerate(categorical_cols):
    agg = categorical_rdd.groupBy(lambda row: row[i]).map(lambda row: (row[0], len(row[1])))
    print(col, sorted(agg.collect(), key=lambda el: el[1], reverse=True))

INFANT_ALIVE_AT_REPORT [(1, 23349), (0, 22080)]
BIRTH_PLACE [('1', 44558), ('4', 327), ('3', 224), ('2', 136), ('7', 91), ('5', 74), ('6', 11), ('9', 8)]
DIABETES_PRE [(0, 44881), (1, 548)]
DIABETES_GEST [(0, 43451), (1, 1978)]
HYP_TENS_PRE [(0, 44348), (1, 1081)]
HYP_TENS_GEST [(0, 43302), (1, 2127)]
PREV_BIRTH_PRETERM [(0, 43088), (1, 2341)]


- 결과를 확인하면 대부분의 유아들은 병원에서 출산했음('BIRTH_PLACE : 1'당연함..)
- 약 550건의 경우는 집에서 출산함(3인 경우는 의도적으로 집에서 출산, 4인 경우는 의도하지 않은 경우)

### 연속형 변수들의 상관계수 확인

In [21]:
# 굉장히 빠름....이전의 corr을 생각X
corr = st.Statistics.corr(numeric_rdd) # numpy ndarray로 리턴됨!!!

for i, el in enumerate(corr > 0.5):
    correlated = [(numeric_cols[j], corr[i][j]) for j, e in enumerate(el) if e == 1.0 and j!=i]
    
    if len(correlated) > 0:
        for e in correlated:
            print('{0}-to-{1} : {2:.2f}'.format(numeric_cols[i], e[0], e[1]))

CIG_BEFORE-to-CIG_1_TRI : 0.83
CIG_BEFORE-to-CIG_2_TRI : 0.72
CIG_BEFORE-to-CIG_3_TRI : 0.62
CIG_1_TRI-to-CIG_BEFORE : 0.83
CIG_1_TRI-to-CIG_2_TRI : 0.87
CIG_1_TRI-to-CIG_3_TRI : 0.76
CIG_2_TRI-to-CIG_BEFORE : 0.72
CIG_2_TRI-to-CIG_1_TRI : 0.87
CIG_2_TRI-to-CIG_3_TRI : 0.89
CIG_3_TRI-to-CIG_BEFORE : 0.62
CIG_3_TRI-to-CIG_1_TRI : 0.76
CIG_3_TRI-to-CIG_2_TRI : 0.89
MOTHER_PRE_WEIGHT-to-MOTHER_DELIVERY_WEIGHT : 0.54
MOTHER_PRE_WEIGHT-to-MOTHER_WEIGHT_GAIN : 0.65
MOTHER_DELIVERY_WEIGHT-to-MOTHER_PRE_WEIGHT : 0.54
MOTHER_DELIVERY_WEIGHT-to-MOTHER_WEIGHT_GAIN : 0.60
MOTHER_WEIGHT_GAIN-to-MOTHER_PRE_WEIGHT : 0.65
MOTHER_WEIGHT_GAIN-to-MOTHER_DELIVERY_WEIGHT : 0.60


### feature selection

상관계수를 확인한 결과 CIG들과 WEIGHT들이 연관성이 높기 때문에 CIG_BEFORE, MOTHER_PRE_WEIGHT만 사용하는 것으로 결정하기로 함

In [22]:
feature_to_keep = [
    'INFANT_ALIVE_AT_REPORT', 
    'BIRTH_PLACE', 
    'MOTHER_AGE_YEARS', 
    'FATHER_COMBINED_AGE', 
    'CIG_1_TRI', 
    'MOTHER_HEIGHT_IN', 
    'MOTHER_PRE_WEIGHT', 
    'DIABETES_PRE', 
    'DIABETES_GEST', 
    'HYP_TENS_PRE', 
    'HYP_TENS_GEST', 
    'PREV_BIRTH_PRETERM'
]

births_transformed = births_transformed.select(feature_to_keep)

In [23]:
births_transformed.show(5)

+----------------------+-----------+----------------+-------------------+---------+----------------+-----------------+------------+-------------+------------+-------------+------------------+
|INFANT_ALIVE_AT_REPORT|BIRTH_PLACE|MOTHER_AGE_YEARS|FATHER_COMBINED_AGE|CIG_1_TRI|MOTHER_HEIGHT_IN|MOTHER_PRE_WEIGHT|DIABETES_PRE|DIABETES_GEST|HYP_TENS_PRE|HYP_TENS_GEST|PREV_BIRTH_PRETERM|
+----------------------+-----------+----------------+-------------------+---------+----------------+-----------------+------------+-------------+------------+-------------+------------------+
|                     0|          1|              29|                 99|        0|              99|              999|           0|            0|           0|            0|                 0|
|                     0|          1|              22|                 29|        0|              65|              180|           0|            0|           0|            0|                 0|
|                     0|          1|    

### 범주형 변수에 대한 통계적 검증
- 카이스퀘어 검정 : 독립성 검증(모집단과 현재 데이터셋의 차이를 검정)

In [24]:
import pyspark.mllib.linalg as ln

for cat in categorical_cols[1:]:
    agg = births_transformed.groupby('INFANT_ALIVE_AT_REPORT').pivot(cat).count() # DataFrame[INFANT_ALIVE_AT_REPORT: int, 1: bigint, 2: bigint, 3: bigint, 4: bigint, 5: bigint, 6: bigint, 7: bigint, 9: bigint]
    agg_rdd = agg.rdd.map(lambda row : (row[1:])).flatMap(lambda row: [0 if e == None else e for e in row]).collect() # [22995, 113, 158, 39, 19, 2, 23, 0, 21563, 23, 66, 288, 55, 9, 68, 8]
    
    row_length = len(agg.collect()[0]) - 1 # 범주형 변수의 수준의 갯수
    
    # 2는 목표변수의 수준임
    agg = ln.Matrices.dense(row_length, 2, agg_rdd) # DenseMatrix(8, 2, [22995.0, 113.0, 158.0, 39.0, 19.0, 2.0, 23.0, 0.0, 21563.0, 23.0, 66.0, 288.0, 55.0, 9.0, 68.0, 8.0], False)
    
    test = st.Statistics.chiSqTest(agg)
    print(cat, round(test.pValue, 4))

BIRTH_PLACE 0.0
DIABETES_PRE 0.0
DIABETES_GEST 0.0
HYP_TENS_PRE 0.0
HYP_TENS_GEST 0.0
PREV_BIRTH_PRETERM 0.0


카이스퀘어 검정의 결과 각각의 범주형 변수는 독립적이라는 것을 확인할 수 있음

## 최총 데이터셋 생성
LabeledPoint(label, feature)

### LabeledPoint RDD 생성
- BIRTH_PLACE변수를 해싱을 이용해 인코딩

In [25]:
births_transformed.printSchema()

root
 |-- INFANT_ALIVE_AT_REPORT: integer (nullable = true)
 |-- BIRTH_PLACE: string (nullable = true)
 |-- MOTHER_AGE_YEARS: integer (nullable = true)
 |-- FATHER_COMBINED_AGE: integer (nullable = true)
 |-- CIG_1_TRI: integer (nullable = true)
 |-- MOTHER_HEIGHT_IN: integer (nullable = true)
 |-- MOTHER_PRE_WEIGHT: integer (nullable = true)
 |-- DIABETES_PRE: integer (nullable = true)
 |-- DIABETES_GEST: integer (nullable = true)
 |-- HYP_TENS_PRE: integer (nullable = true)
 |-- HYP_TENS_GEST: integer (nullable = true)
 |-- PREV_BIRTH_PRETERM: integer (nullable = true)



In [26]:
births_transformed.show(2)

+----------------------+-----------+----------------+-------------------+---------+----------------+-----------------+------------+-------------+------------+-------------+------------------+
|INFANT_ALIVE_AT_REPORT|BIRTH_PLACE|MOTHER_AGE_YEARS|FATHER_COMBINED_AGE|CIG_1_TRI|MOTHER_HEIGHT_IN|MOTHER_PRE_WEIGHT|DIABETES_PRE|DIABETES_GEST|HYP_TENS_PRE|HYP_TENS_GEST|PREV_BIRTH_PRETERM|
+----------------------+-----------+----------------+-------------------+---------+----------------+-----------------+------------+-------------+------------+-------------+------------------+
|                     0|          1|              29|                 99|        0|              99|              999|           0|            0|           0|            0|                 0|
|                     0|          1|              22|                 29|        0|              65|              180|           0|            0|           0|            0|                 0|
+----------------------+-----------+----

In [27]:
import pyspark.mllib.feature as ft
import pyspark.mllib.regression as reg

hashing = ft.HashingTF(7) # SparseVector로 변경하기 위해 사용함

births_hashed = births_transformed.rdd.map(lambda row: [
    list(hashing.transform(row[1]).toArray()) 
        if col == 'BIRTH_PLACE' 
        else row[i]
    for i, col in enumerate(feature_to_keep)]
).map(lambda row: [[e] if type(e) == int else e for e in row]
).map(lambda row: [item for sublist in row for item in sublist]
).map(lambda row: reg.LabeledPoint(row[0], ln.Vectors.dense(row[1:])))

In [28]:
births_hashed.take(2) # BIRTH_PLACE의 수준은 8개 -> 해싱으로 표현 7개

[LabeledPoint(0.0, [1.0,0.0,0.0,0.0,0.0,0.0,0.0,29.0,99.0,0.0,99.0,999.0,0.0,0.0,0.0,0.0,0.0]),
 LabeledPoint(0.0, [1.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0,29.0,0.0,65.0,180.0,0.0,0.0,0.0,0.0,0.0])]

### 데이터 분할(train, test)

In [29]:
birth_train, birth_test = births_hashed.randomSplit([0.6, 0.4])

## 유아 생존율 예측 모형
- 로지스틱 회귀 모형
- RandomForest모형

### MLlib 로지스틱 회귀
- spark2.0 : LogisticRegressionWithLBFGS
> 제한 메모리 BGFS(Broyden-Fletcher-Goldfarb-Shanno)
> - http://aria24.com/blog/2014/12/understanding-lbfgs
> - http://darkpgmr.tistory.com/149

In [30]:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS

# iteration파라미터에 대한 이해 심화
LR_model = LogisticRegressionWithLBFGS.train(birth_train, iterations=10)

In [31]:
LR_result = (
    birth_test.map(lambda row: row.label).zip(LR_model.predict(birth_test.map(lambda row: row.features)))
).map(lambda row: (row[0], row[1] * 1.0))

In [32]:
# LR_result.take(3) # 실제 값과 예측값의 라벨이 들어있는 튜플을 리턴함

In [33]:
import pyspark.mllib.evaluation as ev

LR_evaluation = ev.BinaryClassificationMetrics(LR_result)
print("Area under PR : {0:.2f}".format(LR_evaluation.areaUnderPR))
print("Area under ROC : {0:.2f}".format(LR_evaluation.areaUnderROC))
LR_evaluation.unpersist()

Area under PR : 0.85
Area under ROC : 0.63


### RandomForest모형

### 변수 selection
- Spark에는 chi-sq-selector를 통해 연속형 변수를 선택하는 메서드가 존재함
> 만약 범주형 변수에 대해서도 적용하고 싶다면 해시값 혹은 더미변수화를 설정

In [35]:
selector = ft.ChiSqSelector(4).fit(birth_train)
topFeatures_train = (
    birth_train.map(lambda row: row.label).zip(selector.transform(birth_train.map(lambda row: row.features)))
).map(lambda row: reg.LabeledPoint(row[0], row[1]))

In [36]:
topFeatures_test = (
    birth_test.map(lambda row: row.label).zip(selector.transform(birth_test.map(lambda row: row.features)))
).map(lambda row: reg.LabeledPoint(row[0], row[1]))
topFeatures_test.take(3)

[LabeledPoint(0.0, [0.0,39.0,42.0,60.0]),
 LabeledPoint(0.0, [0.0,22.0,25.0,68.0]),
 LabeledPoint(0.0, [0.0,39.0,66.0,65.0])]

### RandomForest모형 설정
- **data** – Training dataset: RDD of LabeledPoint. Labels should take values {0, 1, …, numClasses-1}.
- **numClasses** – Number of classes for classification.
- **categoricalFeaturesInfo** – Map storing arity of categorical features. An entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, …, k-1}.
- **numTrees** – Number of trees in the random forest.
- **featureSubsetStrategy** – Number of features to consider for splits at each node. Supported values: “auto”, “all”, “sqrt”, “log2”, “onethird”. If “auto” is set, this parameter is set based on numTrees: if numTrees == 1, set to “all”; if numTrees > 1 (forest) set to “sqrt”. (default: “auto”)
- **impurity** – Criterion used for information gain calculation. Supported values: “gini” or “entropy”. (default: “gini”)
- **maxDepth** – Maximum depth of tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes). (default: 4)
- **maxBins** – Maximum number of bins used for splitting features. (default: 32)
- **seed** – Random seed for bootstrapping and choosing feature subsets. Set as None to generate seed based on system time. (default: None)

In [37]:
from pyspark.mllib.tree import RandomForest

In [38]:
# categorical features가 없기 때문에 {}로 설정함
RF_model = RandomForest.trainClassifier(data = topFeatures_train, 
                                        numClasses=2, categoricalFeaturesInfo={} , 
                                        numTrees=6, featureSubsetStrategy='all', seed=66)

In [39]:
RF_result = (
    topFeatures_test.map(lambda row: row.label
                        ).zip(RF_model.predict(topFeatures_test.map(lambda row: row.features)))
)

RF_evaluation = ev.BinaryClassificationMetrics(RF_result)
print("Area under PR : {0:.2f}".format(RF_evaluation.areaUnderPR))
print("Area under ROC : {0:.2f}".format(RF_evaluation.areaUnderROC))
RF_evaluation.unpersist()

Area under PR : 0.88
Area under ROC : 0.62


### 동일한 접근으로 로지스틱 회귀에 적용

In [40]:
LR_model_2 = LogisticRegressionWithLBFGS.train(topFeatures_train, iterations=10)
LR_result_2 = (
    topFeatures_test.map(lambda row: row.label
                        ).zip(LR_model_2.predict(topFeatures_test.map(lambda row: row.features)))
).map(lambda row: (row[0], row[1] * 1.0))

LR_evaluation_2 = ev.BinaryClassificationMetrics(LR_result_2)

print("Area under PR : {0:.2f}".format(LR_evaluation_2.areaUnderPR))
print("Area under ROC : {0:.2f}".format(LR_evaluation_2.areaUnderROC))
LR_evaluation_2.unpersist()

Area under PR : 0.89
Area under ROC : 0.62
