# Kaggle Compettion - Classification with an Academic Success Dataset 3

### AutoML을 사용한 모델링
마이크로 소프트의 AutoML 모델을 사용한 성능 테스트를 해보았다. 여러가지 머신러닝 모델들을 통합하여 자동으로 개별 모델들의 하이퍼 파라미터 튜닝을 진행하고 그 결과를 반환한다. 개별 모델들을 튜닝하는 번거롭고 복잡한 과정들이 자동화되어 사용이 간편하다. 그러나 더 좋은 성능을 만들기 위해서는 AutoML 자체를 튜닝하는데 또한 시간을 투자하여야 한다.

### 개요
- 1. Setting
- 2. Data Search
- 3. Feature dtype 재조정
- 4. Feature의 이름에서 공백 제거
- 5. DataFrame memory 최적화
- 6. AutoML Modeling (1)
   - AutoML 기본 설정으로 성능 테스트
- 7. AutoML Modeling (2)
   - 학습 데이터의 모든 데이터를 원본 데이터의 범주형 값으로 변경한 후 성능 테스트

## 1. Setting

In [145]:
submission_df = pd.read_csv("./data/sample_submission.csv")
train_df = pd.read_csv("./data/train.csv")
test_df = pd.read_csv("./data/test.csv")

submission_df.shape, train_df.shape, test_df.shape

((51012, 2), (76518, 38), (51012, 37))

## 2. Data Search

In [131]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76518 entries, 0 to 76517
Data columns (total 38 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   id                                              76518 non-null  int64  
 1   Marital status                                  76518 non-null  int64  
 2   Application mode                                76518 non-null  int64  
 3   Application order                               76518 non-null  int64  
 4   Course                                          76518 non-null  int64  
 5   Daytime/evening attendance                      76518 non-null  int64  
 6   Previous qualification                          76518 non-null  int64  
 7   Previous qualification (grade)                  76518 non-null  float64
 8   Nacionality                                     76518 non-null  int64  
 9   Mother's qualification                 

In [132]:
train_df.nunique()

id                                                76518
Marital status                                        6
Application mode                                     22
Application order                                     8
Course                                               19
Daytime/evening attendance                            2
Previous qualification                               21
Previous qualification (grade)                      110
Nacionality                                          18
Mother's qualification                               35
Father's qualification                               39
Mother's occupation                                  40
Father's occupation                                  56
Admission grade                                     668
Displaced                                             2
Educational special needs                             2
Debtor                                                2
Tuition fees up to date                         

In [133]:
test_df.nunique()

id                                                51012
Marital status                                        6
Application mode                                     20
Application order                                     8
Course                                               21
Daytime/evening attendance                            2
Previous qualification                               20
Previous qualification (grade)                      108
Nacionality                                          18
Mother's qualification                               32
Father's qualification                               36
Mother's occupation                                  38
Father's occupation                                  49
Admission grade                                     653
Displaced                                             2
Educational special needs                             2
Debtor                                                2
Tuition fees up to date                         

In [134]:
round(train_df["Target"].value_counts() * 100 / len(train_df), 2)

Target
Graduate    47.42
Dropout     33.06
Enrolled    19.52
Name: count, dtype: float64

In [135]:
train_df.isnull().sum()

id                                                0
Marital status                                    0
Application mode                                  0
Application order                                 0
Course                                            0
Daytime/evening attendance                        0
Previous qualification                            0
Previous qualification (grade)                    0
Nacionality                                       0
Mother's qualification                            0
Father's qualification                            0
Mother's occupation                               0
Father's occupation                               0
Admission grade                                   0
Displaced                                         0
Educational special needs                         0
Debtor                                            0
Tuition fees up to date                           0
Gender                                            0
Scholarship 

## 3. Feature dtype 재조정
- numeric, categorical features의 dtype 변경
   - 앞의 모델링 테스트 과정에서 어떤 feature를 numeric 또는 categorical feature로 설정하느냐에 따라서 모델의 성능 차이가 발생한다는 것을 알 수 있었다.
   - 이를 토대로 기존에 feature 별 dtype을 설정했던 것과 다른 방식으로 모든 feature의 모든 유니크 값을 확인한 후에 정하였다.

In [7]:
categorical_features = []
numeric_features = []

### 원본 데이터의 feature desc

In [190]:
from ucimlrepo import fetch_ucirepo

student_data = fetch_ucirepo(id=697)
feature_info = student_data.variables
feature_info.head()

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,Marital Status,Feature,Integer,Marital Status,1 – single 2 – married 3 – widower 4 – divorce...,,no
1,Application mode,Feature,Integer,,1 - 1st phase - general contingent 2 - Ordinan...,,no
2,Application order,Feature,Integer,,Application order (between 0 - first choice; a...,,no
3,Course,Feature,Integer,,33 - Biofuel Production Technologies 171 - Ani...,,no
4,Daytime/evening attendance,Feature,Integer,,1 – daytime 0 - evening,,no


In [191]:
feature_info.loc[0, "name"] = 'Marital status'
feature_info.loc[0, "name"]

'Marital status'

In [192]:
def get_feature_desc(col) : 
    
    '''
    UCI 패키지에서 feature 별 unique 데이터의 description을 반환하는 함수
    '''
    
    desc = feature_info.query("name == @col")["description"].values[0]
    
    return desc

In [193]:
get_feature_desc("Marital status")

'1 – single 2 – married 3 – widower 4 – divorced 5 – facto union 6 – legally separated'

In [14]:
diff_test_train_uniques = {}

def train_test_unique_desc(col) : 
    
    global diff_test_train_uniques
    
    print(f'''
    ~~~ feature's unique and desc ~~~
    * train df's unique : {np.sort(train_df[col].unique())}
    * test df's unique : {np.sort(test_df[col].unique())}
    * compare unique (test - train) : {list(set(test_df[col].unique()).difference(train_df[col].unique()))}
    * origin data desc :
       {get_feature_desc(col)}
    ''')
    
    diff_test_train = list(set(test_df[col].unique()).difference(train_df[col].unique()))
    
    if len(diff_test_train) > 0 : 
        diff_test_train_uniques[col] = diff_test_train
        
    return diff_test_train_uniques    

### Marital status feature
- unique data : 1부터 6까지의 정수 이지만 카테고리 데이터를 숫자화 한 것

In [15]:
diff_uniques = train_test_unique_desc("Marital status")
diff_uniques


    ~~~ feature's unique and desc ~~~
    * train df's unique : [1 2 3 4 5 6]
    * test df's unique : [1 2 3 4 5 6]
    * compare unique (test - train) : []
    * origin data desc :
       1 – single 2 – married 3 – widower 4 – divorced 5 – facto union 6 – legally separated
    


{}

In [16]:
categorical_features.append("Marital status")
categorical_features

['Marital status']

### Application mode faeture

In [19]:
train_test_unique_desc("Application mode")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 1  2  3  4  5  7  9 10 12 15 16 17 18 26 27 35 39 42 43 44 51 53]
    * test df's unique : [ 1  2  3  5  7 10 14 15 16 17 18 19 27 35 39 42 43 44 51 53]
    * compare unique (test - train) : [19, 14]
    * origin data desc :
       1 - 1st phase - general contingent 2 - Ordinance No. 612/93 5 - 1st phase - special contingent (Azores Island) 7 - Holders of other higher courses 10 - Ordinance No. 854-B/99 15 - International student (bachelor) 16 - 1st phase - special contingent (Madeira Island) 17 - 2nd phase - general contingent 18 - 3rd phase - general contingent 26 - Ordinance No. 533-A/99, item b2) (Different Plan) 27 - Ordinance No. 533-A/99, item b3 (Other Institution) 39 - Over 23 years old 42 - Transfer 43 - Change of course 44 - Technological specialization diploma holders 51 - Change of institution/course 53 - Short cycle diploma holders 57 - Change of institution/course (International)
    


{'Application mode': [19, 14]}

In [20]:
diff_test_train_uniques

{'Application mode': [19, 14]}

In [21]:
categorical_features.append("Application mode")
categorical_features

['Marital status', 'Application mode']

### Application order feature

In [22]:
train_test_unique_desc("Application order")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [0 1 2 3 4 5 6 9]
    * test df's unique : [0 1 2 3 4 5 6 9]
    * compare unique (test - train) : []
    * origin data desc :
       Application order (between 0 - first choice; and 9 last choice)
    


{'Application mode': [19, 14]}

In [23]:
diff_test_train_uniques

{'Application mode': [19, 14]}

In [24]:
categorical_features.append("Application order")
categorical_features

['Marital status', 'Application mode', 'Application order']

### Course feature

In [25]:
train_test_unique_desc("Course")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [  33   39  171  979 8014 9003 9070 9085 9119 9130 9147 9238 9254 9500
 9556 9670 9773 9853 9991]
    * test df's unique : [  33  171 2105 4147 7500 8014 9003 9070 9085 9119 9130 9147 9238 9254
 9257 9500 9556 9670 9773 9853 9991]
    * compare unique (test - train) : [9257, 7500, 4147, 2105]
    * origin data desc :
       33 - Biofuel Production Technologies 171 - Animation and Multimedia Design 8014 - Social Service (evening attendance) 9003 - Agronomy 9070 - Communication Design 9085 - Veterinary Nursing 9119 - Informatics Engineering 9130 - Equinculture 9147 - Management 9238 - Social Service 9254 - Tourism 9500 - Nursing 9556 - Oral Hygiene 9670 - Advertising and Marketing Management 9773 - Journalism and Communication 9853 - Basic Education 9991 - Management (evening attendance)
    


{'Application mode': [19, 14], 'Course': [9257, 7500, 4147, 2105]}

In [26]:
diff_test_train_uniques

{'Application mode': [19, 14], 'Course': [9257, 7500, 4147, 2105]}

In [27]:
categorical_features.append("Course")
categorical_features

['Marital status', 'Application mode', 'Application order', 'Course']

### Daytime/evening attendance feature

In [28]:
train_test_unique_desc("Daytime/evening attendance")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [0 1]
    * test df's unique : [0 1]
    * compare unique (test - train) : []
    * origin data desc :
       1 – daytime 0 - evening
    


{'Application mode': [19, 14], 'Course': [9257, 7500, 4147, 2105]}

In [29]:
diff_test_train_uniques

{'Application mode': [19, 14], 'Course': [9257, 7500, 4147, 2105]}

In [30]:
categorical_features.append("Daytime/evening attendance")
categorical_features

['Marital status',
 'Application mode',
 'Application order',
 'Course',
 'Daytime/evening attendance']

### Previous qualification features

In [31]:
train_test_unique_desc("Previous qualification")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 1  2  3  4  5  6  9 10 11 12 14 15 17 19 36 37 38 39 40 42 43]
    * test df's unique : [ 1  2  3  4  5  6  9 10 11 12 14 15 16 17 19 38 39 40 42 43]
    * compare unique (test - train) : [16]
    * origin data desc :
       1 - Secondary education 2 - Higher education - bachelor's degree 3 - Higher education - degree 4 - Higher education - master's 5 - Higher education - doctorate 6 - Frequency of higher education 9 - 12th year of schooling - not completed 10 - 11th year of schooling - not completed 12 - Other - 11th year of schooling 14 - 10th year of schooling 15 - 10th year of schooling - not completed 19 - Basic education 3rd cycle (9th/10th/11th year) or equiv. 38 - Basic education 2nd cycle (6th/7th/8th year) or equiv. 39 - Technological specialization course 40 - Higher education - degree (1st cycle) 42 - Professional higher technical course 43 - Higher education - master (2nd cycle)
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16]}

In [32]:
categorical_features.append("Previous qualification")
categorical_features

['Marital status',
 'Application mode',
 'Application order',
 'Course',
 'Daytime/evening attendance',
 'Previous qualification']

### Previous qualification (grade) feature

In [33]:
train_test_unique_desc("Previous qualification (grade)")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 95.   96.   97.   99.  100.  101.  102.  103.  105.  106.  107.  108.
 109.  110.  111.  112.  113.  114.  115.  116.  117.  118.  118.4 118.9
 119.  119.1 119.3 119.5 120.  121.  122.  123.  123.9 124.  124.4 125.
 126.  126.6 126.8 127.  128.  129.  130.  131.  132.  133.  133.1 134.
 135.  136.  137.  138.  138.4 138.6 138.7 139.  139.3 140.  140.6 140.8
 141.  142.  143.  144.  145.  145.7 145.9 146.  147.  148.  148.8 148.9
 149.  150.  151.  152.  153.  154.  154.9 155.  156.  157.  158.  159.
 160.  161.  162.  162.9 163.  164.  165.  166.  167.  168.  169.  170.
 171.  172.  173.  174.  175.  176.  177.  178.  180.  182.  184.  184.4
 188.  190. ]
    * test df's unique : [ 95.   96.   97.   98.   99.  100.  100.1 101.  102.  102.4 103.  105.
 106.  107.  108.  109.  110.  111.  112.  113.  114.  115.  116.  117.
 117.4 118.  118.4 119.  119.1 119.4 120.  121.  122.  123.  123.4 124.
 124.4 125.  126.  126.6 127

{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4]}

In [34]:
diff_test_train_uniques

{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4]}

In [35]:
numeric_features.append("Previous qualification (grade)")
numeric_features

['Previous qualification (grade)']

### Nacionality feature

In [36]:
train_test_unique_desc("Nacionality")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [  1   2   6  11  17  21  22  24  25  26  32  41  62 100 101 103 105 109]
    * test df's unique : [  1   2   6  11  14  21  22  24  25  26  32  41  62 100 101 103 105 109]
    * compare unique (test - train) : [14]
    * origin data desc :
       1 - Portuguese; 2 - German; 6 - Spanish; 11 - Italian; 13 - Dutch; 14 - English; 17 - Lithuanian; 21 - Angolan; 22 - Cape Verdean; 24 - Guinean; 25 - Mozambican; 26 - Santomean; 32 - Turkish; 41 - Brazilian; 62 - Romanian; 100 - Moldova (Republic of); 101 - Mexican; 103 - Ukrainian; 105 - Russian; 108 - Cuban; 109 - Colombian
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14]}

In [37]:
categorical_features.append("Nacionality")
categorical_features

['Marital status',
 'Application mode',
 'Application order',
 'Course',
 'Daytime/evening attendance',
 'Previous qualification',
 'Nacionality']

### Mother's qualification feature

In [38]:
train_test_unique_desc("Mother's qualification")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 1  2  3  4  5  6  7  8  9 10 11 12 14 15 18 19 22 26 27 28 29 30 31 33
 34 35 36 37 38 39 40 41 42 43 44]
    * test df's unique : [ 1  2  3  4  5  6  9 10 11 12 13 14 18 19 22 25 26 29 30 31 33 34 35 36
 37 38 39 40 41 42 43 44]
    * compare unique (test - train) : [13, 25]
    * origin data desc :
       1 - Secondary Education - 12th Year of Schooling or Eq. 2 - Higher Education - Bachelor's Degree 3 - Higher Education - Degree 4 - Higher Education - Master's 5 - Higher Education - Doctorate 6 - Frequency of Higher Education 9 - 12th Year of Schooling - Not Completed 10 - 11th Year of Schooling - Not Completed 11 - 7th Year (Old) 12 - Other - 11th Year of Schooling 14 - 10th Year of Schooling 18 - General commerce course 19 - Basic Education 3rd Cycle (9th/10th/11th Year) or Equiv. 22 - Technical-professional course 26 - 7th year of schooling 27 - 2nd cycle of the general high school course 29 - 9th Year of Schoolin

{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25]}

In [39]:
categorical_features.append("Mother's qualification")
categorical_features

['Marital status',
 'Application mode',
 'Application order',
 'Course',
 'Daytime/evening attendance',
 'Previous qualification',
 'Nacionality',
 "Mother's qualification"]

### 'Father's qualification feature

In [40]:
train_test_unique_desc("Father's qualification")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 1  2  3  4  5  6  7  9 10 11 12 13 14 15 18 19 20 21 22 23 24 25 26 27
 29 30 31 33 34 35 36 37 38 39 40 41 42 43 44]
    * test df's unique : [ 1  2  3  4  5  6  7  9 10 11 12 13 14 16 18 19 21 22 25 26 27 28 29 30
 31 33 34 35 36 37 38 39 40 41 42 43]
    * compare unique (test - train) : [16, 28]
    * origin data desc :
       1 - Secondary Education - 12th Year of Schooling or Eq. 2 - Higher Education - Bachelor's Degree 3 - Higher Education - Degree 4 - Higher Education - Master's 5 - Higher Education - Doctorate 6 - Frequency of Higher Education 9 - 12th Year of Schooling - Not Completed 10 - 11th Year of Schooling - Not Completed 11 - 7th Year (Old) 12 - Other - 11th Year of Schooling 13 - 2nd year complementary high school course 14 - 10th Year of Schooling 18 - General commerce course 19 - Basic Education 3rd Cycle (9th/10th/11th Year) or Equiv. 20 - Complementary High School Course 22 - Technical-professional

{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28]}

In [41]:
categorical_features.append("Father's qualification")
categorical_features

['Marital status',
 'Application mode',
 'Application order',
 'Course',
 'Daytime/evening attendance',
 'Previous qualification',
 'Nacionality',
 "Mother's qualification",
 "Father's qualification"]

### Mother's occupation feature

In [42]:
train_test_unique_desc("Mother's occupation")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [  0   1   2   3   4   5   6   7   8   9  10  11  38  90  99 101 103 122
 123 124 125 127 131 132 134 141 143 144 151 152 153 163 171 172 173 175
 191 192 193 194]
    * test df's unique : [  0   1   2   3   4   5   6   7   8   9  10  90  98  99 122 123 124 125
 131 132 133 134 141 143 144 151 152 153 154 171 173 174 175 181 191 192
 193 194]
    * compare unique (test - train) : [98, 133, 174, 181, 154]
    * origin data desc :
       0 - Student 1 - Representatives of the Legislative Power and Executive Bodies, Directors, Directors and Executive Managers 2 - Specialists in Intellectual and Scientific Activities 3 - Intermediate Level Technicians and Professions 4 - Administrative staff 5 - Personal Services, Security and Safety Workers and Sellers 6 - Farmers and Skilled Workers in Agriculture, Fisheries and Forestry 7 - Skilled Workers in Industry, Construction and Craftsmen 8 - Installation and Machine Operators and A

{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154]}

In [43]:
categorical_features.append("Mother's occupation")
categorical_features

['Marital status',
 'Application mode',
 'Application order',
 'Course',
 'Daytime/evening attendance',
 'Previous qualification',
 'Nacionality',
 "Mother's qualification",
 "Father's qualification",
 "Mother's occupation"]

### Father's occupation feature

In [44]:
train_test_unique_desc("Father's occupation")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  19  22  39  90
  96  99 101 102 103 112 114 121 122 123 124 125 131 132 134 135 141 143
 144 148 151 152 153 154 161 163 171 172 174 175 181 182 183 191 192 193
 194 195]
    * test df's unique : [  0   1   2   3   4   5   6   7   8   9  10  11  90  99 101 102 103 112
 113 114 120 121 122 123 124 125 131 134 135 141 143 144 151 152 153 154
 161 163 171 172 174 175 181 182 183 192 193 194 195]
    * compare unique (test - train) : [113, 120]
    * origin data desc :
       0 - Student 1 - Representatives of the Legislative Power and Executive Bodies, Directors, Directors and Executive Managers 2 - Specialists in Intellectual and Scientific Activities 3 - Intermediate Level Technicians and Professions 4 - Administrative staff 5 - Personal Services, Security and Safety Workers and Sellers 6 - Farmers and Skilled Workers in Agriculture, Fisheries and Forestry 7 - Skill

{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120]}

In [45]:
categorical_features.append("Father's occupation")
categorical_features

['Marital status',
 'Application mode',
 'Application order',
 'Course',
 'Daytime/evening attendance',
 'Previous qualification',
 'Nacionality',
 "Mother's qualification",
 "Father's qualification",
 "Mother's occupation",
 "Father's occupation"]

### Admission grade feature

In [46]:
train_test_unique_desc("Admission grade")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 95.          95.1         95.5         95.7         95.8
  96.          96.1         96.5         96.7         97.
  97.2         97.4         97.5         97.6         98.
  98.1         98.4         98.5         98.6         98.7
  98.8         98.9         99.          99.3         99.5
  99.6         99.7        100.         100.1        100.2
 100.5        100.6        100.7        100.8        100.9
 101.         101.2        101.3        101.5        101.6
 101.7        101.8        102.         102.2        102.4
 102.5        102.6        102.8        103.         103.4
 103.5        103.6        103.7        103.8        104.
 104.1        104.2        104.3        104.4        104.5
 104.6        104.7        104.8        105.         105.1
 105.2        105.3        105.4        105.5        105.6
 105.7        105.8        105.9        106.         106.1
 106.2        106.3        106.4        106.5        

{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516]}

In [47]:
numeric_features.append("Admission grade")
numeric_features

['Previous qualification (grade)', 'Admission grade']

### Displaced feature

In [48]:
train_test_unique_desc("Displaced")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [0 1]
    * test df's unique : [0 1]
    * compare unique (test - train) : []
    * origin data desc :
       1 – yes 0 – no
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516]}

In [49]:
categorical_features.append("Displaced")
categorical_features

['Marital status',
 'Application mode',
 'Application order',
 'Course',
 'Daytime/evening attendance',
 'Previous qualification',
 'Nacionality',
 "Mother's qualification",
 "Father's qualification",
 "Mother's occupation",
 "Father's occupation",
 'Displaced']

### Educational special needs feature

In [50]:
train_test_unique_desc("Educational special needs")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [0 1]
    * test df's unique : [0 1]
    * compare unique (test - train) : []
    * origin data desc :
       1 – yes 0 – no
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516]}

In [51]:
categorical_features.append("Educational special needs")
categorical_features

['Marital status',
 'Application mode',
 'Application order',
 'Course',
 'Daytime/evening attendance',
 'Previous qualification',
 'Nacionality',
 "Mother's qualification",
 "Father's qualification",
 "Mother's occupation",
 "Father's occupation",
 'Displaced',
 'Educational special needs']

### Debtor feature

In [52]:
train_test_unique_desc("Debtor")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [0 1]
    * test df's unique : [0 1]
    * compare unique (test - train) : []
    * origin data desc :
       1 – yes 0 – no
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516]}

In [53]:
categorical_features.append("Debtor")
categorical_features

['Marital status',
 'Application mode',
 'Application order',
 'Course',
 'Daytime/evening attendance',
 'Previous qualification',
 'Nacionality',
 "Mother's qualification",
 "Father's qualification",
 "Mother's occupation",
 "Father's occupation",
 'Displaced',
 'Educational special needs',
 'Debtor']

### Tuition fees up to date feature

In [54]:
train_test_unique_desc("Tuition fees up to date")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [0 1]
    * test df's unique : [0 1]
    * compare unique (test - train) : []
    * origin data desc :
       1 – yes 0 – no
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516]}

In [55]:
categorical_features.append("Tuition fees up to date")
categorical_features

['Marital status',
 'Application mode',
 'Application order',
 'Course',
 'Daytime/evening attendance',
 'Previous qualification',
 'Nacionality',
 "Mother's qualification",
 "Father's qualification",
 "Mother's occupation",
 "Father's occupation",
 'Displaced',
 'Educational special needs',
 'Debtor',
 'Tuition fees up to date']

### Gender feature

In [56]:
train_test_unique_desc("Gender")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [0 1]
    * test df's unique : [0 1]
    * compare unique (test - train) : []
    * origin data desc :
       1 – male 0 – female
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516]}

In [57]:
categorical_features.append("Gender")
categorical_features

['Marital status',
 'Application mode',
 'Application order',
 'Course',
 'Daytime/evening attendance',
 'Previous qualification',
 'Nacionality',
 "Mother's qualification",
 "Father's qualification",
 "Mother's occupation",
 "Father's occupation",
 'Displaced',
 'Educational special needs',
 'Debtor',
 'Tuition fees up to date',
 'Gender']

### Scholarship holder feature

In [58]:
train_test_unique_desc("Scholarship holder")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [0 1]
    * test df's unique : [0 1]
    * compare unique (test - train) : []
    * origin data desc :
       1 – yes 0 – no
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516]}

In [59]:
categorical_features.append("Scholarship holder")
categorical_features

['Marital status',
 'Application mode',
 'Application order',
 'Course',
 'Daytime/evening attendance',
 'Previous qualification',
 'Nacionality',
 "Mother's qualification",
 "Father's qualification",
 "Mother's occupation",
 "Father's occupation",
 'Displaced',
 'Educational special needs',
 'Debtor',
 'Tuition fees up to date',
 'Gender',
 'Scholarship holder']

### Age at enrollment feature

In [60]:
train_test_unique_desc("Age at enrollment")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 57 58 59 60 61 62 70]
    * test df's unique : [17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 57 58 59 60 61 62 70]
    * compare unique (test - train) : []
    * origin data desc :
       Age of studend at enrollment
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516]}

In [64]:
numeric_features.append("Age at enrollment")
numeric_features

['Previous qualification (grade)', 'Admission grade', 'Age at enrollment']

### International feature

In [65]:
train_test_unique_desc("International")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [0 1]
    * test df's unique : [0 1]
    * compare unique (test - train) : []
    * origin data desc :
       1 – yes 0 – no
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516]}

In [66]:
categorical_features.append("International")
categorical_features

['Marital status',
 'Application mode',
 'Application order',
 'Course',
 'Daytime/evening attendance',
 'Previous qualification',
 'Nacionality',
 "Mother's qualification",
 "Father's qualification",
 "Mother's occupation",
 "Father's occupation",
 'Displaced',
 'Educational special needs',
 'Debtor',
 'Tuition fees up to date',
 'Gender',
 'Scholarship holder',
 'International']

### Curricular units 1st sem (credited) feature

In [67]:
train_test_unique_desc("Curricular units 1st sem (credited)")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
    * test df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
    * compare unique (test - train) : []
    * origin data desc :
       Number of curricular units credited in the 1st semester
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516]}

In [68]:
numeric_features.append("Curricular units 1st sem (credited)")
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)']

### Curricular units 1st sem (enrolled) feature

In [69]:
train_test_unique_desc("Curricular units 1st sem (enrolled)")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 21 22 23 26]
    * test df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 23]
    * compare unique (test - train) : [20]
    * origin data desc :
       Number of curricular units enrolled in the 1st semester
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516],
 'Curricular units 1st sem (enrolled)': [20]}

In [70]:
numeric_features.append("Curricular units 1st sem (enrolled)")
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)']

### Curricular units 1st sem (evaluations) feature

In [71]:
train_test_unique_desc("Curricular units 1st sem (evaluations)")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 31 32 33 35 36 45]
    * test df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 26 27 28 29 32 33 36 37 45]
    * compare unique (test - train) : [37]
    * origin data desc :
       Number of evaluations to curricular units in the 1st semester
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516],
 'Curricular units 1st sem (enrolled)': [20],
 'Curricular units 1st sem (evaluations)': [37]}

In [72]:
numeric_features.append("Curricular units 1st sem (evaluations)")
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)',
 'Curricular units 1st sem (evaluations)']

### Curricular units 1st sem (approved) feature

In [73]:
train_test_unique_desc("Curricular units 1st sem (approved)")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 26]
    * test df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 23]
    * compare unique (test - train) : [23]
    * origin data desc :
       Number of curricular units approved in the 1st semester
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516],
 'Curricular units 1st sem (enrolled)': [20],
 'Curricular units 1st sem (evaluations)': [37],
 'Curricular units 1st sem (approved)': [23]}

In [74]:
numeric_features.append("Curricular units 1st sem (approved)")
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)',
 'Curricular units 1st sem (evaluations)',
 'Curricular units 1st sem (approved)']

### Curricular units 1st sem (grade) feature

In [75]:
train_test_unique_desc("Curricular units 1st sem (grade)")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 0.          1.          5.66666667 ... 17.33333333 18.
 18.875     ]
    * test df's unique : [ 0.          1.          2.         ... 17.69230769 18.
 18.875     ]
    * compare unique (test - train) : [2.0, 6.333333333333334, 8.2, 8.714285714285714, 16.875, 13.94, 14.777777777778, 15.777777777778, 11.5625, 12.900375, 13.711111111111112, 14.193333333333332, 15.137692307692308, 14.0625, 13.125714285714285, 12.927272727272726, 13.927272727272726, 12.2727272727272, 15.566666666666668, 13.982857142857142, 16.285714285714285, 14.1444444444444, 11.26, 13.094444444444443, 14.035714285714286, 13.392857142857144, 12.507142857142856, 10.7, 11.98125, 11.2625, 13.23125, 14.10625, 15.3875, 17.0055555555556, 12.98857142857143, 12.91111111111111, 14.477777777778, 14.7555555555556, 14.1665, 12.1818181818182, 15.016666666666666, 16.692307692307693, 17.692307692307693, 13.1818181818182, 11.847142857142858, 12.48625, 13.86125, 13.2666666

{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516],
 'Curricular units 1st sem (enrolled)': [20],
 'Curricular units 1st sem (evaluations)': [37],
 'Curricular units 1st sem (approved)': [23],
 'Curricular units 1st sem (grade)': [2.0,
  6.333333333333334,
  8.2,
  8.714285714285714,
  16.875,
  13.94,
  14.777777777778,
  15.777777777778,
  11.5625,
  12.900375,
  13.711111111111112,
  14.193333333333332,
  15.137

In [76]:
numeric_features.append("Curricular units 1st sem (grade)")
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)',
 'Curricular units 1st sem (evaluations)',
 'Curricular units 1st sem (approved)',
 'Curricular units 1st sem (grade)']

### Curricular units 1st sem (without evaluations) feature

In [77]:
train_test_unique_desc("Curricular units 1st sem (without evaluations)")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 12]
    * test df's unique : [ 0  1  2  3  4  5  6  7  8 10 12]
    * compare unique (test - train) : []
    * origin data desc :
       Number of curricular units without evalutions in the 1st semester
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516],
 'Curricular units 1st sem (enrolled)': [20],
 'Curricular units 1st sem (evaluations)': [37],
 'Curricular units 1st sem (approved)': [23],
 'Curricular units 1st sem (grade)': [2.0,
  6.333333333333334,
  8.2,
  8.714285714285714,
  16.875,
  13.94,
  14.777777777778,
  15.777777777778,
  11.5625,
  12.900375,
  13.711111111111112,
  14.193333333333332,
  15.137

In [78]:
numeric_features.append("Curricular units 1st sem (without evaluations)")
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)',
 'Curricular units 1st sem (evaluations)',
 'Curricular units 1st sem (approved)',
 'Curricular units 1st sem (grade)',
 'Curricular units 1st sem (without evaluations)']

### Curricular units 2nd sem (credited) feature

In [79]:
train_test_unique_desc("Curricular units 2nd sem (credited)")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]
    * test df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 18 19]
    * compare unique (test - train) : []
    * origin data desc :
       Number of curricular units credited in the 2nd semester
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516],
 'Curricular units 1st sem (enrolled)': [20],
 'Curricular units 1st sem (evaluations)': [37],
 'Curricular units 1st sem (approved)': [23],
 'Curricular units 1st sem (grade)': [2.0,
  6.333333333333334,
  8.2,
  8.714285714285714,
  16.875,
  13.94,
  14.777777777778,
  15.777777777778,
  11.5625,
  12.900375,
  13.711111111111112,
  14.193333333333332,
  15.137

In [80]:
numeric_features.append("Curricular units 2nd sem (credited)")
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)',
 'Curricular units 1st sem (evaluations)',
 'Curricular units 1st sem (approved)',
 'Curricular units 1st sem (grade)',
 'Curricular units 1st sem (without evaluations)',
 'Curricular units 2nd sem (credited)']

### Curricular units 2nd sem (enrolled) feature

In [81]:
train_test_unique_desc("Curricular units 2nd sem (enrolled)")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 21 23]
    * test df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 21 22 23]
    * compare unique (test - train) : [22]
    * origin data desc :
       Number of curricular units enrolled in the 2nd semester
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516],
 'Curricular units 1st sem (enrolled)': [20],
 'Curricular units 1st sem (evaluations)': [37],
 'Curricular units 1st sem (approved)': [23],
 'Curricular units 1st sem (grade)': [2.0,
  6.333333333333334,
  8.2,
  8.714285714285714,
  16.875,
  13.94,
  14.777777777778,
  15.777777777778,
  11.5625,
  12.900375,
  13.711111111111112,
  14.193333333333332,
  15.137

In [82]:
numeric_features.append("Curricular units 2nd sem (enrolled)")
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)',
 'Curricular units 1st sem (evaluations)',
 'Curricular units 1st sem (approved)',
 'Curricular units 1st sem (grade)',
 'Curricular units 1st sem (without evaluations)',
 'Curricular units 2nd sem (credited)',
 'Curricular units 2nd sem (enrolled)']

### Curricular units 2nd sem (evaluations) feature

In [83]:
train_test_unique_desc("Curricular units 2nd sem (evaluations)")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 31 33]
    * test df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 25 26 27 28 29 33]
    * compare unique (test - train) : [29]
    * origin data desc :
       Number of evaluations to curricular units in the 2nd semester
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516],
 'Curricular units 1st sem (enrolled)': [20],
 'Curricular units 1st sem (evaluations)': [37],
 'Curricular units 1st sem (approved)': [23],
 'Curricular units 1st sem (grade)': [2.0,
  6.333333333333334,
  8.2,
  8.714285714285714,
  16.875,
  13.94,
  14.777777777778,
  15.777777777778,
  11.5625,
  12.900375,
  13.711111111111112,
  14.193333333333332,
  15.137

In [84]:
numeric_features.append("Curricular units 2nd sem (evaluations)")
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)',
 'Curricular units 1st sem (evaluations)',
 'Curricular units 1st sem (approved)',
 'Curricular units 1st sem (grade)',
 'Curricular units 1st sem (without evaluations)',
 'Curricular units 2nd sem (credited)',
 'Curricular units 2nd sem (enrolled)',
 'Curricular units 2nd sem (evaluations)']

### Curricular units 2nd sem (approved) feature

In [85]:
train_test_unique_desc("Curricular units 2nd sem (approved)")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
    * test df's unique : [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
    * compare unique (test - train) : []
    * origin data desc :
       Number of curricular units approved in the 2nd semester
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516],
 'Curricular units 1st sem (enrolled)': [20],
 'Curricular units 1st sem (evaluations)': [37],
 'Curricular units 1st sem (approved)': [23],
 'Curricular units 1st sem (grade)': [2.0,
  6.333333333333334,
  8.2,
  8.714285714285714,
  16.875,
  13.94,
  14.777777777778,
  15.777777777778,
  11.5625,
  12.900375,
  13.711111111111112,
  14.193333333333332,
  15.137

In [86]:
numeric_features.append("Curricular units 2nd sem (approved)")
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)',
 'Curricular units 1st sem (evaluations)',
 'Curricular units 1st sem (approved)',
 'Curricular units 1st sem (grade)',
 'Curricular units 1st sem (without evaluations)',
 'Curricular units 2nd sem (credited)',
 'Curricular units 2nd sem (enrolled)',
 'Curricular units 2nd sem (evaluations)',
 'Curricular units 2nd sem (approved)']

### Curricular units 2nd sem (grade) feature

In [87]:
train_test_unique_desc("Curricular units 2nd sem (grade)")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 0.          1.          2.2        ... 17.69230769 17.71428571
 18.        ]
    * test df's unique : [ 0.          1.          7.66666667 ... 17.5875     17.6
 17.71428571]
    * compare unique (test - train) : [8.25, 17.5, 11.777777777778, 13.198571428571428, 14.777777777778, 12.3125, 11.193333333333332, 12.1875, 13.1875, 10.818181818181818, 12.566666666666668, 13.566666666666668, 15.566666666666668, 15.1277777777778, 14.7727272727272, 15.828571428571427, 13.862857142857145, 14.2575, 16.46153846153846, 14.035714285714286, 11.757142857142856, 10.85625, 10.575, 11.6375, 13.075, 14.6375, 14.91875, 15.1375, 13.188, 14.441428571428572, 15.725714285714288, 13.114285714285714, 13.0555555555555, 13.5555555555555, 14.0555555555555, 15.5555555555555, 15.016666666666666, 12.918571428571427, 12.48625, 12.266666666666667, 13.86125, 14.812857142857142, 12.7125, 13.80625, 13.341666666666669, 14.937142857142858, 10.923076923076923, 1

{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516],
 'Curricular units 1st sem (enrolled)': [20],
 'Curricular units 1st sem (evaluations)': [37],
 'Curricular units 1st sem (approved)': [23],
 'Curricular units 1st sem (grade)': [2.0,
  6.333333333333334,
  8.2,
  8.714285714285714,
  16.875,
  13.94,
  14.777777777778,
  15.777777777778,
  11.5625,
  12.900375,
  13.711111111111112,
  14.193333333333332,
  15.137

In [88]:
numeric_features.append("Curricular units 2nd sem (grade)")
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)',
 'Curricular units 1st sem (evaluations)',
 'Curricular units 1st sem (approved)',
 'Curricular units 1st sem (grade)',
 'Curricular units 1st sem (without evaluations)',
 'Curricular units 2nd sem (credited)',
 'Curricular units 2nd sem (enrolled)',
 'Curricular units 2nd sem (evaluations)',
 'Curricular units 2nd sem (approved)',
 'Curricular units 2nd sem (grade)']

### Curricular units 2nd sem (without evaluations) feature

In [89]:
train_test_unique_desc("Curricular units 2nd sem (without evaluations)")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 0  1  2  3  4  5  6  7  8 10 12]
    * test df's unique : [ 0  1  2  3  4  5  6  7  8 10]
    * compare unique (test - train) : []
    * origin data desc :
       Number of curricular units without evalutions in the 1st semester
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516],
 'Curricular units 1st sem (enrolled)': [20],
 'Curricular units 1st sem (evaluations)': [37],
 'Curricular units 1st sem (approved)': [23],
 'Curricular units 1st sem (grade)': [2.0,
  6.333333333333334,
  8.2,
  8.714285714285714,
  16.875,
  13.94,
  14.777777777778,
  15.777777777778,
  11.5625,
  12.900375,
  13.711111111111112,
  14.193333333333332,
  15.137

In [90]:
numeric_features.append("Curricular units 2nd sem (without evaluations)")
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)',
 'Curricular units 1st sem (evaluations)',
 'Curricular units 1st sem (approved)',
 'Curricular units 1st sem (grade)',
 'Curricular units 1st sem (without evaluations)',
 'Curricular units 2nd sem (credited)',
 'Curricular units 2nd sem (enrolled)',
 'Curricular units 2nd sem (evaluations)',
 'Curricular units 2nd sem (approved)',
 'Curricular units 2nd sem (grade)',
 'Curricular units 2nd sem (without evaluations)']

### Unemployment rate feature

In [91]:
train_test_unique_desc("Unemployment rate")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [ 7.6  8.9  9.4 10.8 11.1 12.4 12.7 13.9 14.5 15.5 16.2]
    * test df's unique : [ 7.6         8.9         9.4        10.8        10.83333333 11.1
 12.4        12.7        13.9        14.9        15.5        16.2       ]
    * compare unique (test - train) : [10.833333333333334, 14.9]
    * origin data desc :
       Unemployment rate (%)
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516],
 'Curricular units 1st sem (enrolled)': [20],
 'Curricular units 1st sem (evaluations)': [37],
 'Curricular units 1st sem (approved)': [23],
 'Curricular units 1st sem (grade)': [2.0,
  6.333333333333334,
  8.2,
  8.714285714285714,
  16.875,
  13.94,
  14.777777777778,
  15.777777777778,
  11.5625,
  12.900375,
  13.711111111111112,
  14.193333333333332,
  15.137

In [92]:
numeric_features.append("Unemployment rate")
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)',
 'Curricular units 1st sem (evaluations)',
 'Curricular units 1st sem (approved)',
 'Curricular units 1st sem (grade)',
 'Curricular units 1st sem (without evaluations)',
 'Curricular units 2nd sem (credited)',
 'Curricular units 2nd sem (enrolled)',
 'Curricular units 2nd sem (evaluations)',
 'Curricular units 2nd sem (approved)',
 'Curricular units 2nd sem (grade)',
 'Curricular units 2nd sem (without evaluations)',
 'Unemployment rate']

### Inflation rate feature

In [93]:
train_test_unique_desc("Inflation rate")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [-0.8 -0.6 -0.3  0.3  0.4  0.5  0.6  0.7  1.4  2.5  2.6  2.8  3.7]
    * test df's unique : [-0.8 -0.6 -0.3  0.   0.3  0.5  0.6  0.8  1.4  2.6  2.8  3.7]
    * compare unique (test - train) : [0.8, 0.0]
    * origin data desc :
       Inflation rate (%)
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516],
 'Curricular units 1st sem (enrolled)': [20],
 'Curricular units 1st sem (evaluations)': [37],
 'Curricular units 1st sem (approved)': [23],
 'Curricular units 1st sem (grade)': [2.0,
  6.333333333333334,
  8.2,
  8.714285714285714,
  16.875,
  13.94,
  14.777777777778,
  15.777777777778,
  11.5625,
  12.900375,
  13.711111111111112,
  14.193333333333332,
  15.137

In [94]:
numeric_features.append("Inflation rate")
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)',
 'Curricular units 1st sem (evaluations)',
 'Curricular units 1st sem (approved)',
 'Curricular units 1st sem (grade)',
 'Curricular units 1st sem (without evaluations)',
 'Curricular units 2nd sem (credited)',
 'Curricular units 2nd sem (enrolled)',
 'Curricular units 2nd sem (evaluations)',
 'Curricular units 2nd sem (approved)',
 'Curricular units 2nd sem (grade)',
 'Curricular units 2nd sem (without evaluations)',
 'Unemployment rate',
 'Inflation rate']

### GDP feature

In [95]:
train_test_unique_desc("GDP")


    ~~~ feature's unique and desc ~~~
    * train df's unique : [-4.06 -3.12 -1.7  -0.92  0.32  0.74  0.79  1.74  1.79  2.02  3.51]
    * test df's unique : [-4.06 -3.12 -1.7  -0.92  0.32  0.79  1.74  1.79  2.02  3.51]
    * compare unique (test - train) : []
    * origin data desc :
       GDP
    


{'Application mode': [19, 14],
 'Course': [9257, 7500, 4147, 2105],
 'Previous qualification': [16],
 'Previous qualification (grade)': [133.8,
  153.9,
  154.4,
  154.6,
  156.9,
  98.0,
  163.4,
  163.3,
  100.1,
  102.4,
  169.7,
  117.4,
  119.4,
  123.4],
 'Nacionality': [14],
 "Mother's qualification": [13, 25],
 "Father's qualification": [16, 28],
 "Mother's occupation": [98, 133, 174, 181, 154],
 "Father's occupation": [113, 120],
 'Admission grade': [148.1,
  151.8,
  153.4,
  155.79,
  161.6,
  162.8,
  165.6,
  168.1,
  171.7,
  95.2,
  97.7,
  97.8,
  98.3,
  101.4,
  101.1,
  103.3,
  107.9,
  116.75,
  126.516],
 'Curricular units 1st sem (enrolled)': [20],
 'Curricular units 1st sem (evaluations)': [37],
 'Curricular units 1st sem (approved)': [23],
 'Curricular units 1st sem (grade)': [2.0,
  6.333333333333334,
  8.2,
  8.714285714285714,
  16.875,
  13.94,
  14.777777777778,
  15.777777777778,
  11.5625,
  12.900375,
  13.711111111111112,
  14.193333333333332,
  15.137

In [96]:
numeric_features.append("GDP")
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)',
 'Curricular units 1st sem (evaluations)',
 'Curricular units 1st sem (approved)',
 'Curricular units 1st sem (grade)',
 'Curricular units 1st sem (without evaluations)',
 'Curricular units 2nd sem (credited)',
 'Curricular units 2nd sem (enrolled)',
 'Curricular units 2nd sem (evaluations)',
 'Curricular units 2nd sem (approved)',
 'Curricular units 2nd sem (grade)',
 'Curricular units 2nd sem (without evaluations)',
 'Unemployment rate',
 'Inflation rate',
 'GDP']

### numeric, categorical features 확인

In [97]:
for f in categorical_features : 
    if f in list(train_df.columns) : 
        print("correct : {}".format(f))
    else : 
        print("wrong : {}".format(f))

correct : Marital status
correct : Application mode
correct : Application order
correct : Course
correct : Daytime/evening attendance
correct : Previous qualification
correct : Nacionality
correct : Mother's qualification
correct : Father's qualification
correct : Mother's occupation
correct : Father's occupation
correct : Displaced
correct : Educational special needs
correct : Debtor
correct : Tuition fees up to date
correct : Gender
correct : Scholarship holder
correct : International


In [98]:
len(categorical_features)

18

In [99]:
numeric_features = [f if f[:3] != "Age" else "Age at enrollment" for f in numeric_features]
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)',
 'Curricular units 1st sem (evaluations)',
 'Curricular units 1st sem (approved)',
 'Curricular units 1st sem (grade)',
 'Curricular units 1st sem (without evaluations)',
 'Curricular units 2nd sem (credited)',
 'Curricular units 2nd sem (enrolled)',
 'Curricular units 2nd sem (evaluations)',
 'Curricular units 2nd sem (approved)',
 'Curricular units 2nd sem (grade)',
 'Curricular units 2nd sem (without evaluations)',
 'Unemployment rate',
 'Inflation rate',
 'GDP']

In [100]:
for f in numeric_features : 
    if f in list(train_df.columns) : 
        print("correct : {}".format(f))
    else : 
        print("wrong : {}".format(f))

correct : Previous qualification (grade)
correct : Admission grade
correct : Age at enrollment
correct : Curricular units 1st sem (credited)
correct : Curricular units 1st sem (enrolled)
correct : Curricular units 1st sem (evaluations)
correct : Curricular units 1st sem (approved)
correct : Curricular units 1st sem (grade)
correct : Curricular units 1st sem (without evaluations)
correct : Curricular units 2nd sem (credited)
correct : Curricular units 2nd sem (enrolled)
correct : Curricular units 2nd sem (evaluations)
correct : Curricular units 2nd sem (approved)
correct : Curricular units 2nd sem (grade)
correct : Curricular units 2nd sem (without evaluations)
correct : Unemployment rate
correct : Inflation rate
correct : GDP


In [101]:
len(numeric_features)

18

In [102]:
set(train_df.columns).difference(categorical_features + numeric_features)

{'Target', 'id'}

### numeric, categorical feature 조정
- Application order feature의 값은 0~9 까지의 정수형 feature에 해당

In [106]:
get_feature_desc("Application order")

'Application order (between 0 - first choice; and 9 last choice)'

In [108]:
train_df["Application order"].unique()

array([1, 2, 3, 6, 4, 5, 0, 9], dtype=int64)

In [110]:
categorical_features.remove("Application order")
categorical_features

['Marital status',
 'Application mode',
 'Course',
 'Daytime/evening attendance',
 'Previous qualification',
 'Nacionality',
 "Mother's qualification",
 "Father's qualification",
 "Mother's occupation",
 "Father's occupation",
 'Displaced',
 'Educational special needs',
 'Debtor',
 'Tuition fees up to date',
 'Gender',
 'Scholarship holder',
 'International']

In [111]:
numeric_features.append("Application order")
numeric_features

['Previous qualification (grade)',
 'Admission grade',
 'Age at enrollment',
 'Curricular units 1st sem (credited)',
 'Curricular units 1st sem (enrolled)',
 'Curricular units 1st sem (evaluations)',
 'Curricular units 1st sem (approved)',
 'Curricular units 1st sem (grade)',
 'Curricular units 1st sem (without evaluations)',
 'Curricular units 2nd sem (credited)',
 'Curricular units 2nd sem (enrolled)',
 'Curricular units 2nd sem (evaluations)',
 'Curricular units 2nd sem (approved)',
 'Curricular units 2nd sem (grade)',
 'Curricular units 2nd sem (without evaluations)',
 'Unemployment rate',
 'Inflation rate',
 'GDP',
 'Application order']

In [112]:
len(categorical_features), len(numeric_features)

(17, 19)

### categorical feature type 설정
- "category"가 아닌 "object"로 설정하여 automl 테스트

In [146]:
train_df[categorical_features] = train_df[categorical_features].astype("object")
train_df.dtypes

id                                                  int64
Marital status                                     object
Application mode                                   object
Application order                                   int64
Course                                             object
Daytime/evening attendance                         object
Previous qualification                             object
Previous qualification (grade)                    float64
Nacionality                                        object
Mother's qualification                             object
Father's qualification                             object
Mother's occupation                                object
Father's occupation                                object
Admission grade                                   float64
Displaced                                          object
Educational special needs                          object
Debtor                                             object
Tuition fees u

In [147]:
test_df[categorical_features] = test_df[categorical_features].astype("object")
test_df.dtypes

id                                                  int64
Marital status                                     object
Application mode                                   object
Application order                                   int64
Course                                             object
Daytime/evening attendance                         object
Previous qualification                             object
Previous qualification (grade)                    float64
Nacionality                                        object
Mother's qualification                             object
Father's qualification                             object
Mother's occupation                                object
Father's occupation                                object
Admission grade                                   float64
Displaced                                          object
Educational special needs                          object
Debtor                                             object
Tuition fees u

## 4. Feature의 이름에서 공백 제거

In [148]:
import re

In [149]:
train_df = train_df.rename(columns=lambda x : re.sub('[^A-Za-z0-9_]+', '', x))
train_df.head()

Unnamed: 0,id,Maritalstatus,Applicationmode,Applicationorder,Course,Daytimeeveningattendance,Previousqualification,Previousqualificationgrade,Nacionality,Mothersqualification,...,Curricularunits2ndsemcredited,Curricularunits2ndsemenrolled,Curricularunits2ndsemevaluations,Curricularunits2ndsemapproved,Curricularunits2ndsemgrade,Curricularunits2ndsemwithoutevaluations,Unemploymentrate,Inflationrate,GDP,Target
0,0,1,1,1,9238,1,1,126.0,1,1,...,0,6,7,6,12.428571,0,11.1,0.6,2.02,Graduate
1,1,1,17,1,9238,1,1,125.0,1,19,...,0,6,9,0,0.0,0,11.1,0.6,2.02,Dropout
2,2,1,17,2,9254,1,1,137.0,1,3,...,0,6,0,0,0.0,0,16.2,0.3,-0.92,Dropout
3,3,1,1,3,9500,1,1,131.0,1,19,...,0,8,11,7,12.82,0,11.1,0.6,2.02,Enrolled
4,4,1,1,2,9500,1,1,132.0,1,19,...,0,7,12,6,12.933333,0,7.6,2.6,0.32,Graduate


In [150]:
test_df = test_df.rename(columns=lambda x: re.sub('[^A-Za-z0-9_]+', '', x))
test_df.head()

Unnamed: 0,id,Maritalstatus,Applicationmode,Applicationorder,Course,Daytimeeveningattendance,Previousqualification,Previousqualificationgrade,Nacionality,Mothersqualification,...,Curricularunits1stsemwithoutevaluations,Curricularunits2ndsemcredited,Curricularunits2ndsemenrolled,Curricularunits2ndsemevaluations,Curricularunits2ndsemapproved,Curricularunits2ndsemgrade,Curricularunits2ndsemwithoutevaluations,Unemploymentrate,Inflationrate,GDP
0,76518,1,1,1,9500,1,1,141.0,1,3,...,0,0,8,0,0,0.0,0,13.9,-0.3,0.79
1,76519,1,1,1,9238,1,1,128.0,1,1,...,0,0,6,6,6,13.5,0,11.1,0.6,2.02
2,76520,1,1,1,9238,1,1,118.0,1,1,...,0,0,6,11,5,11.0,0,15.5,2.8,-4.06
3,76521,1,44,1,9147,1,39,130.0,1,1,...,0,3,8,14,5,11.0,0,8.9,1.4,3.51
4,76522,1,39,1,9670,1,1,110.0,1,1,...,0,0,6,9,4,10.666667,2,7.6,2.6,0.32


## 5. DataFrame memory 최적화
- 각 데이터 타입의 최소, 최대 값에 맞게 feature dtype 변경
   - 최적화를 하면 feature 별 dtype이 세분화 되어 바뀌고, 용량도 낮아진다.

In [151]:
def reduce_mem_usage(df) : 
    
    start_mem = df.memory_usage().sum() / 1024 ** 2
    print("Memory usage of df is {:.2f} MB".format(start_mem))

    for col in df.columns : 
        #print(col)
        col_type = df[col].dtype
        #print(col_type)

        if col_type != object : 
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int" :
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max : 
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max :
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max :
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max :
                    df[col] = df[col].astype(np.int64)
            else :
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max :
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max :
                    df[col] = df[col].astype(np.float32)
                else : 
                    df[col] = df[col].astype(np.float64)
        else : 
            df[col] = df[col].astype("object")

    end_mem = df.memory_usage().sum() / 1024 ** 2
    print("Memory usage after optimizeation is : {:.2f} MB".format(end_mem))
    print("Decreased by {:.1f}%".format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [152]:
train = reduce_mem_usage(train_df)
train.info()

Memory usage of df is 22.18 MB
Memory usage after optimizeation is : 12.70 MB
Decreased by 42.8%
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76518 entries, 0 to 76517
Data columns (total 38 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   id                                       76518 non-null  int32  
 1   Maritalstatus                            76518 non-null  object 
 2   Applicationmode                          76518 non-null  object 
 3   Applicationorder                         76518 non-null  int8   
 4   Course                                   76518 non-null  object 
 5   Daytimeeveningattendance                 76518 non-null  object 
 6   Previousqualification                    76518 non-null  object 
 7   Previousqualificationgrade               76518 non-null  float16
 8   Nacionality                              76518 non-null  object 
 9   Mothersqualificatio

In [153]:
test = reduce_mem_usage(test_df)
test.info()

Memory usage of df is 14.40 MB
Memory usage after optimizeation is : 8.08 MB
Decreased by 43.9%
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51012 entries, 0 to 51011
Data columns (total 37 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   id                                       51012 non-null  int32  
 1   Maritalstatus                            51012 non-null  object 
 2   Applicationmode                          51012 non-null  object 
 3   Applicationorder                         51012 non-null  int8   
 4   Course                                   51012 non-null  object 
 5   Daytimeeveningattendance                 51012 non-null  object 
 6   Previousqualification                    51012 non-null  object 
 7   Previousqualificationgrade               51012 non-null  float16
 8   Nacionality                              51012 non-null  object 
 9   Mothersqualification

In [154]:
train_df.memory_usage().sum() / 1024 ** 2

12.697467803955078

In [25]:
np.iinfo(np.int8)

iinfo(min=-128, max=127, dtype=int8)

In [27]:
np.iinfo(np.int8).min, np.iinfo(np.int8).max

(-128, 127)

In [31]:
np.iinfo(np.int16)

iinfo(min=-32768, max=32767, dtype=int16)

In [32]:
np.iinfo(np.int32)

iinfo(min=-2147483648, max=2147483647, dtype=int32)

In [33]:
np.iinfo(np.int64)

iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

## 6. AutoML Modeling (1)
- anaconda prompt에서 설치
   - pip install flaml
- 기본 설정을 사용하여 모델 성능 테스트   

### 설치 확인

In [125]:
!conda list "^flaml"

# packages in environment at C:\DS\Anaconda3\envs\dev_env:
#
# Name                    Version                   Build  Channel
flaml                     2.1.2                    pypi_0    pypi


### AutoML 패키지 임포트

In [126]:
from flaml import AutoML

In [128]:
automl = AutoML()
automl

In [156]:
y = train.pop("Target")
X = train.drop("id", axis=1)

X.shape, y.shape

((76518, 36), (76518,))

In [157]:
X.head()

Unnamed: 0,Maritalstatus,Applicationmode,Applicationorder,Course,Daytimeeveningattendance,Previousqualification,Previousqualificationgrade,Nacionality,Mothersqualification,Fathersqualification,...,Curricularunits1stsemwithoutevaluations,Curricularunits2ndsemcredited,Curricularunits2ndsemenrolled,Curricularunits2ndsemevaluations,Curricularunits2ndsemapproved,Curricularunits2ndsemgrade,Curricularunits2ndsemwithoutevaluations,Unemploymentrate,Inflationrate,GDP
0,1,1,1,9238,1,1,126.0,1,1,19,...,0,0,6,7,6,12.429688,0,11.101562,0.600098,2.019531
1,1,17,1,9238,1,1,125.0,1,19,19,...,0,0,6,9,0,0.0,0,11.101562,0.600098,2.019531
2,1,17,2,9254,1,1,137.0,1,3,19,...,0,0,6,0,0,0.0,0,16.203125,0.300049,-0.919922
3,1,1,3,9500,1,1,131.0,1,19,3,...,0,0,8,11,7,12.820312,0,11.101562,0.600098,2.019531
4,1,1,2,9500,1,1,132.0,1,19,37,...,0,0,7,12,6,12.929688,0,7.601562,2.599609,0.320068


### AuroML fitting

In [158]:
automl.fit(X, y, task="classification", metric="roc_auc_ovo", time_budget=3600*3)

[flaml.automl.logger: 07-09 16:10:43] {1680} INFO - task = classification
[flaml.automl.logger: 07-09 16:10:43] {1691} INFO - Evaluation method: cv
[flaml.automl.logger: 07-09 16:10:43] {1789} INFO - Minimizing error metric: 1-roc_auc_ovo
[flaml.automl.logger: 07-09 16:10:43] {1901} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl.logger: 07-09 16:10:43] {2219} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 07-09 16:10:44] {2345} INFO - Estimated sufficient time budget=10848s. Estimated necessary time budget=250s.
[flaml.automl.logger: 07-09 16:10:44] {2392} INFO -  at 1.7s,	estimator lgbm's best error=0.1123,	best estimator lgbm's best error=0.1123
[flaml.automl.logger: 07-09 16:10:44] {2219} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 07-09 16:10:45] {2392} INFO -  at 2.8s,	estimator lgbm's best error=0.1123,	best estimator lgbm's best error=0.1123
[flaml.automl.logger: 07-09 1

[flaml.automl.logger: 07-09 16:11:44] {2219} INFO - iteration 34, current learner lgbm
[flaml.automl.logger: 07-09 16:11:51] {2392} INFO -  at 68.5s,	estimator lgbm's best error=0.0720,	best estimator lgbm's best error=0.0720
[flaml.automl.logger: 07-09 16:11:51] {2219} INFO - iteration 35, current learner xgboost
[flaml.automl.logger: 07-09 16:11:53] {2392} INFO -  at 70.8s,	estimator xgboost's best error=0.0746,	best estimator lgbm's best error=0.0720
[flaml.automl.logger: 07-09 16:11:53] {2219} INFO - iteration 36, current learner xgboost
[flaml.automl.logger: 07-09 16:11:55] {2392} INFO -  at 73.1s,	estimator xgboost's best error=0.0746,	best estimator lgbm's best error=0.0720
[flaml.automl.logger: 07-09 16:11:55] {2219} INFO - iteration 37, current learner xgboost
[flaml.automl.logger: 07-09 16:11:57] {2392} INFO -  at 74.9s,	estimator xgboost's best error=0.0746,	best estimator lgbm's best error=0.0720
[flaml.automl.logger: 07-09 16:11:57] {2219} INFO - iteration 38, current lear

[flaml.automl.logger: 07-09 16:16:59] {2392} INFO -  at 376.8s,	estimator rf's best error=0.0808,	best estimator xgboost's best error=0.0711
[flaml.automl.logger: 07-09 16:16:59] {2219} INFO - iteration 70, current learner xgb_limitdepth
[flaml.automl.logger: 07-09 16:17:01] {2392} INFO -  at 378.7s,	estimator xgb_limitdepth's best error=0.0791,	best estimator xgboost's best error=0.0711
[flaml.automl.logger: 07-09 16:17:01] {2219} INFO - iteration 71, current learner xgb_limitdepth
[flaml.automl.logger: 07-09 16:17:03] {2392} INFO -  at 380.6s,	estimator xgb_limitdepth's best error=0.0791,	best estimator xgboost's best error=0.0711
[flaml.automl.logger: 07-09 16:17:03] {2219} INFO - iteration 72, current learner xgb_limitdepth
[flaml.automl.logger: 07-09 16:17:05] {2392} INFO -  at 382.6s,	estimator xgb_limitdepth's best error=0.0757,	best estimator xgboost's best error=0.0711
[flaml.automl.logger: 07-09 16:17:05] {2219} INFO - iteration 73, current learner xgb_limitdepth
[flaml.autom

[flaml.automl.logger: 07-09 16:23:25] {2219} INFO - iteration 104, current learner lgbm
[flaml.automl.logger: 07-09 16:23:56] {2392} INFO -  at 793.2s,	estimator lgbm's best error=0.0717,	best estimator xgboost's best error=0.0710
[flaml.automl.logger: 07-09 16:23:56] {2219} INFO - iteration 105, current learner lrl1
[flaml.automl.logger: 07-09 16:24:29] {2392} INFO -  at 826.8s,	estimator lrl1's best error=0.0894,	best estimator xgboost's best error=0.0710
[flaml.automl.logger: 07-09 16:24:29] {2219} INFO - iteration 106, current learner xgboost
[flaml.automl.logger: 07-09 16:24:40] {2392} INFO -  at 837.3s,	estimator xgboost's best error=0.0710,	best estimator xgboost's best error=0.0710
[flaml.automl.logger: 07-09 16:24:40] {2219} INFO - iteration 107, current learner lrl1
[flaml.automl.logger: 07-09 16:25:15] {2392} INFO -  at 872.8s,	estimator lrl1's best error=0.0894,	best estimator xgboost's best error=0.0710
[flaml.automl.logger: 07-09 16:25:15] {2219} INFO - iteration 108, cur

[flaml.automl.logger: 07-09 16:56:57] {2219} INFO - iteration 139, current learner rf
[flaml.automl.logger: 07-09 16:57:00] {2392} INFO -  at 2777.4s,	estimator rf's best error=0.0801,	best estimator xgboost's best error=0.0705
[flaml.automl.logger: 07-09 16:57:00] {2219} INFO - iteration 140, current learner xgb_limitdepth
[flaml.automl.logger: 07-09 16:57:15] {2392} INFO -  at 2792.4s,	estimator xgb_limitdepth's best error=0.0717,	best estimator xgboost's best error=0.0705
[flaml.automl.logger: 07-09 16:57:15] {2219} INFO - iteration 141, current learner xgb_limitdepth
[flaml.automl.logger: 07-09 16:57:19] {2392} INFO -  at 2796.2s,	estimator xgb_limitdepth's best error=0.0717,	best estimator xgboost's best error=0.0705
[flaml.automl.logger: 07-09 16:57:19] {2219} INFO - iteration 142, current learner xgb_limitdepth
[flaml.automl.logger: 07-09 16:57:40] {2392} INFO -  at 2818.1s,	estimator xgb_limitdepth's best error=0.0709,	best estimator xgboost's best error=0.0705
[flaml.automl.lo

[flaml.automl.logger: 07-09 17:26:52] {2219} INFO - iteration 173, current learner xgb_limitdepth
[flaml.automl.logger: 07-09 17:31:34] {2392} INFO -  at 4851.8s,	estimator xgb_limitdepth's best error=0.0705,	best estimator xgboost's best error=0.0705
[flaml.automl.logger: 07-09 17:31:34] {2219} INFO - iteration 174, current learner xgb_limitdepth
[flaml.automl.logger: 07-09 17:31:42] {2392} INFO -  at 4859.4s,	estimator xgb_limitdepth's best error=0.0705,	best estimator xgboost's best error=0.0705
[flaml.automl.logger: 07-09 17:31:42] {2219} INFO - iteration 175, current learner xgb_limitdepth
[flaml.automl.logger: 07-09 17:32:23] {2392} INFO -  at 4900.3s,	estimator xgb_limitdepth's best error=0.0705,	best estimator xgboost's best error=0.0705
[flaml.automl.logger: 07-09 17:32:23] {2219} INFO - iteration 176, current learner xgboost
[flaml.automl.logger: 07-09 17:33:46] {2392} INFO -  at 4983.8s,	estimator xgboost's best error=0.0705,	best estimator xgboost's best error=0.0705
[flaml

[flaml.automl.logger: 07-09 18:13:15] {2392} INFO -  at 7352.6s,	estimator xgb_limitdepth's best error=0.0704,	best estimator xgb_limitdepth's best error=0.0704
[flaml.automl.logger: 07-09 18:13:15] {2219} INFO - iteration 207, current learner xgb_limitdepth
[flaml.automl.logger: 07-09 18:14:04] {2392} INFO -  at 7401.3s,	estimator xgb_limitdepth's best error=0.0704,	best estimator xgb_limitdepth's best error=0.0704
[flaml.automl.logger: 07-09 18:14:04] {2219} INFO - iteration 208, current learner xgb_limitdepth
[flaml.automl.logger: 07-09 18:15:04] {2392} INFO -  at 7462.1s,	estimator xgb_limitdepth's best error=0.0704,	best estimator xgb_limitdepth's best error=0.0704
[flaml.automl.logger: 07-09 18:15:04] {2219} INFO - iteration 209, current learner xgb_limitdepth
[flaml.automl.logger: 07-09 18:17:07] {2392} INFO -  at 7584.4s,	estimator xgb_limitdepth's best error=0.0704,	best estimator xgb_limitdepth's best error=0.0704
[flaml.automl.logger: 07-09 18:17:07] {2219} INFO - iteration 

[flaml.automl.logger: 07-09 18:45:46] {2392} INFO -  at 9303.8s,	estimator extra_tree's best error=0.0797,	best estimator xgb_limitdepth's best error=0.0704
[flaml.automl.logger: 07-09 18:45:46] {2219} INFO - iteration 240, current learner extra_tree
[flaml.automl.logger: 07-09 18:45:49] {2392} INFO -  at 9306.8s,	estimator extra_tree's best error=0.0797,	best estimator xgb_limitdepth's best error=0.0704
[flaml.automl.logger: 07-09 18:45:49] {2219} INFO - iteration 241, current learner extra_tree
[flaml.automl.logger: 07-09 18:45:54] {2392} INFO -  at 9311.9s,	estimator extra_tree's best error=0.0797,	best estimator xgb_limitdepth's best error=0.0704
[flaml.automl.logger: 07-09 18:45:54] {2219} INFO - iteration 242, current learner lgbm
[flaml.automl.logger: 07-09 18:57:52] {2392} INFO -  at 10029.4s,	estimator lgbm's best error=0.0707,	best estimator xgb_limitdepth's best error=0.0704
[flaml.automl.logger: 07-09 18:57:52] {2219} INFO - iteration 243, current learner extra_tree
[flaml.

### AutoML fitting 결과 확인

In [170]:
automl.best_estimator

'xgb_limitdepth'

In [171]:
automl.best_config

{'n_estimators': 1293,
 'max_depth': 4,
 'min_child_weight': 48.21731443475587,
 'learning_rate': 0.04127378507307332,
 'subsample': 0.8279283104587344,
 'colsample_bylevel': 0.509297255770788,
 'colsample_bytree': 0.6964971278154858,
 'reg_alpha': 0.0057603964502711685,
 'reg_lambda': 2.603650089755664}

In [167]:
print(f'''
* Best ML : {automl.best_estimator}
* Best Hyperparameter config : 
   {automl.best_config}
* Best roc_auc_ovo on validation data : {1 - automl.best_loss:.4g}
* Trainning duration of best run : {automl.best_config_train_time:.4g}
''')


* Best ML : xgb_limitdepth
* Best Hyperparameter config : 
   {'n_estimators': 1293, 'max_depth': 4, 'min_child_weight': 48.21731443475587, 'learning_rate': 0.04127378507307332, 'subsample': 0.8279283104587344, 'colsample_bylevel': 0.509297255770788, 'colsample_bytree': 0.6964971278154858, 'reg_alpha': 0.0057603964502711685, 'reg_lambda': 2.603650089755664}
* Best roc_auc_ovo on validation data : 0.9296
* Trainning duration of best run : 22.06



### AutoML 모델로 test 데이터 예측

In [168]:
y_pred = automl.predict(test)
y_pred[:5]

array(['Dropout', 'Graduate', 'Graduate', 'Graduate', 'Enrolled'],
      dtype=object)

In [169]:
automl_submission_df = submission_df.copy()
automl_submission_df["Target"] = y_pred
automl_submission_df.head()

Unnamed: 0,id,Target
0,76518,Dropout
1,76519,Graduate
2,76520,Graduate
3,76521,Graduate
4,76522,Enrolled


In [172]:
automl_submission_df.to_csv("./submission/automl_submission.csv", index=False)

In [173]:
import os

In [174]:
os.listdir("./submission")

['automl_submission.csv',
 'binning_10_submission_df.csv',
 'binning_20_submission_df.csv',
 'binning_30_submission_df.csv',
 'binning_cate_submission_df.csv',
 'binning_cate_submission_df_2.csv',
 'binning_cate_submission_df_3.csv',
 'binning_logtrans_histGB_grid_cv_submission_df.csv',
 'binning_logtrans_histGB_grid_cv_submission_df_2.csv',
 'cap_temp_submission_df.csv',
 'histGB_best_estimator_dorp_association_features.csv',
 'histGB_best_estimator_feature_importances_69.csv',
 'histGB_best_estimator_feature_importances_70.csv',
 'histGB_best_estimator_feature_importances_79.csv',
 'histGB_best_estimator_feature_importance_119.csv',
 'histGB_best_estimator_submission_df.csv',
 'histGB_best_estimator_submission_df_2.csv',
 'histgb_grid_cv_best_model_submission_df.csv',
 'linearR_grid_cv_best_model_submission_df.csv',
 'log_trans_submission_df.csv',
 'nive_prior_submission_df.csv',
 'ordinal_interaction_select_submission_df.csv',
 'quant_cap_submission_df.csv',
 'selectkb_200_histGB_gr

## 7. AutoML Modeling (2)
- train 데이터에서 categorical feature의 데이터를 원본의 데이터 형태로 변환 
   - 정수 -> 범주형 데이터
- automl의 metric을 accuracy로 설정

### Categorical Feature preprocess
- train 데이터의 categorical features 데이터를 원본의 데이터로 변환
- 이 데이터로 automl 실험

In [183]:
def clean_desc(col) : 
    
    '''
    UCI 원본 페이지에서 제공하는 variable description을 정리하는 함수
       - [("cate_num", "cate")] 로 반환
    '''
    
    desc = get_feature_desc(col).split(" ")
    cates = []
    
    for i, c in enumerate(desc) :
        cate = ""
        # str이 카테고리 값에 대한 숫자로 이루어졌는지 확인 후 
        # 현재 index 이후 값들을 반복문으로 확인하여 desc를 만든다.
        if c.isdigit() : 
            cate_num = c
            
            for d in desc[i+1:] :
                # str이 숫자가 아닌 경우 = description의 단어들을 하나의 str로 만든다.
                if not d.isdigit() :
                    if d not in ["–", "-"] :
                        cate += d + " "
                # str이 숫자이면 다른 카테고리 값에 대한 번호이므로 종료        
                elif d.isdigit() : 
                    break
                    
            cate = cate.rstrip(" ")
            cates.append((int(cate_num), cate))
    
    return cates

### desc 에서 feature 이름을 전처리
- 공백, 특수기호 제거

In [203]:
feature_info["name"] = feature_info["name"].apply(lambda x: re.sub('[^A-Za-z0-9_]+', '', x))
feature_info

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,Maritalstatus,Feature,Integer,Marital Status,1 – single 2 – married 3 – widower 4 – divorce...,,no
1,Applicationmode,Feature,Integer,,1 - 1st phase - general contingent 2 - Ordinan...,,no
2,Applicationorder,Feature,Integer,,Application order (between 0 - first choice; a...,,no
3,Course,Feature,Integer,,33 - Biofuel Production Technologies 171 - Ani...,,no
4,Daytimeeveningattendance,Feature,Integer,,1 – daytime 0 - evening,,no
5,Previousqualification,Feature,Integer,Education Level,1 - Secondary education 2 - Higher education -...,,no
6,Previousqualificationgrade,Feature,Continuous,,Grade of previous qualification (between 0 and...,,no
7,Nacionality,Feature,Integer,Nationality,1 - Portuguese; 2 - German; 6 - Spanish; 11 - ...,,no
8,Mothersqualification,Feature,Integer,Education Level,1 - Secondary Education - 12th Year of Schooli...,,no
9,Fathersqualification,Feature,Integer,Education Level,1 - Secondary Education - 12th Year of Schooli...,,no


### 기존 categorical featrues 전처리
- 공백, 특수기호 제거

In [212]:
re_cate = pd.DataFrame(categorical_features, columns=["cate_feature"])["cate_feature"].apply(lambda x: re.sub("[^A-Za-z0-9_]+", "", x)).values
re_cate

array(['Maritalstatus', 'Applicationmode', 'Course',
       'Daytimeeveningattendance', 'Previousqualification', 'Nacionality',
       'Mothersqualification', 'Fathersqualification',
       'Mothersoccupation', 'Fathersoccupation', 'Displaced',
       'Educationalspecialneeds', 'Debtor', 'Tuitionfeesuptodate',
       'Gender', 'Scholarshipholder', 'International'], dtype=object)

### train 데이터의 categorical features의 데이터 변환
- 정수 -> 범주형

In [213]:
for col in re_cate : 
    col_desc = clean_desc(col)
    col_mapper = dict(col_desc)
    train[col] = train[col].map(col_mapper)
    
train.head()    

Unnamed: 0,id,Maritalstatus,Applicationmode,Applicationorder,Course,Daytimeeveningattendance,Previousqualification,Previousqualificationgrade,Nacionality,Mothersqualification,...,Curricularunits1stsemwithoutevaluations,Curricularunits2ndsemcredited,Curricularunits2ndsemenrolled,Curricularunits2ndsemevaluations,Curricularunits2ndsemapproved,Curricularunits2ndsemgrade,Curricularunits2ndsemwithoutevaluations,Unemploymentrate,Inflationrate,GDP
0,0,single,1st phase general contingent,1,Social Service,daytime,Secondary education,126.0,Portuguese;,Secondary Education 12th Year of Schooling or Eq.,...,0,0,6,7,6,12.429688,0,11.101562,0.600098,2.019531
1,1,single,2nd phase general contingent,1,Social Service,daytime,Secondary education,125.0,Portuguese;,Basic Education 3rd Cycle (9th/10th/11th Year)...,...,0,0,6,9,0,0.0,0,11.101562,0.600098,2.019531
2,2,single,2nd phase general contingent,2,Tourism,daytime,Secondary education,137.0,Portuguese;,Higher Education Degree,...,0,0,6,0,0,0.0,0,16.203125,0.300049,-0.919922
3,3,single,1st phase general contingent,3,Nursing,daytime,Secondary education,131.0,Portuguese;,Basic Education 3rd Cycle (9th/10th/11th Year)...,...,0,0,8,11,7,12.820312,0,11.101562,0.600098,2.019531
4,4,single,1st phase general contingent,2,Nursing,daytime,Secondary education,132.0,Portuguese;,Basic Education 3rd Cycle (9th/10th/11th Year)...,...,0,0,7,12,6,12.929688,0,7.601562,2.599609,0.320068


In [214]:
train = train.drop("id", axis=1)
train.head()

Unnamed: 0,Maritalstatus,Applicationmode,Applicationorder,Course,Daytimeeveningattendance,Previousqualification,Previousqualificationgrade,Nacionality,Mothersqualification,Fathersqualification,...,Curricularunits1stsemwithoutevaluations,Curricularunits2ndsemcredited,Curricularunits2ndsemenrolled,Curricularunits2ndsemevaluations,Curricularunits2ndsemapproved,Curricularunits2ndsemgrade,Curricularunits2ndsemwithoutevaluations,Unemploymentrate,Inflationrate,GDP
0,single,1st phase general contingent,1,Social Service,daytime,Secondary education,126.0,Portuguese;,Secondary Education 12th Year of Schooling or Eq.,Basic Education 3rd Cycle (9th/10th/11th Year)...,...,0,0,6,7,6,12.429688,0,11.101562,0.600098,2.019531
1,single,2nd phase general contingent,1,Social Service,daytime,Secondary education,125.0,Portuguese;,Basic Education 3rd Cycle (9th/10th/11th Year)...,Basic Education 3rd Cycle (9th/10th/11th Year)...,...,0,0,6,9,0,0.0,0,11.101562,0.600098,2.019531
2,single,2nd phase general contingent,2,Tourism,daytime,Secondary education,137.0,Portuguese;,Higher Education Degree,Basic Education 3rd Cycle (9th/10th/11th Year)...,...,0,0,6,0,0,0.0,0,16.203125,0.300049,-0.919922
3,single,1st phase general contingent,3,Nursing,daytime,Secondary education,131.0,Portuguese;,Basic Education 3rd Cycle (9th/10th/11th Year)...,Higher Education Degree,...,0,0,8,11,7,12.820312,0,11.101562,0.600098,2.019531
4,single,1st phase general contingent,2,Nursing,daytime,Secondary education,132.0,Portuguese;,Basic Education 3rd Cycle (9th/10th/11th Year)...,Basic education 1st cycle (4th/5th year) or eq...,...,0,0,7,12,6,12.929688,0,7.601562,2.599609,0.320068


In [216]:
train[re_cate].dtypes

Maritalstatus               object
Applicationmode             object
Course                      object
Daytimeeveningattendance    object
Previousqualification       object
Nacionality                 object
Mothersqualification        object
Fathersqualification        object
Mothersoccupation           object
Fathersoccupation           object
Displaced                   object
Educationalspecialneeds     object
Debtor                      object
Tuitionfeesuptodate         object
Gender                      object
Scholarshipholder           object
International               object
dtype: object

### AutoML fitting
- metric="accuracy"

In [218]:
X_2 = train.copy()
y_2 = y.copy()

X_2.shape, y_2.shape

((76518, 36), (76518,))

In [219]:
automl.fit(X_2, y_2, task="classification", metric="accuracy", time_budget=3600*3)

[flaml.automl.logger: 07-09 20:03:19] {1680} INFO - task = classification
[flaml.automl.logger: 07-09 20:03:19] {1691} INFO - Evaluation method: cv
[flaml.automl.logger: 07-09 20:03:19] {1789} INFO - Minimizing error metric: 1-accuracy
[flaml.automl.logger: 07-09 20:03:19] {1901} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl.logger: 07-09 20:03:19] {2219} INFO - iteration 0, current learner lgbm
[flaml.automl.logger: 07-09 20:03:20] {2345} INFO - Estimated sufficient time budget=9896s. Estimated necessary time budget=228s.
[flaml.automl.logger: 07-09 20:03:20] {2392} INFO -  at 1.5s,	estimator lgbm's best error=0.2616,	best estimator lgbm's best error=0.2616
[flaml.automl.logger: 07-09 20:03:20] {2219} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 07-09 20:03:21] {2392} INFO -  at 2.5s,	estimator lgbm's best error=0.2616,	best estimator lgbm's best error=0.2616
[flaml.automl.logger: 07-09 20:03

[flaml.automl.logger: 07-09 20:04:20] {2219} INFO - iteration 34, current learner lgbm
[flaml.automl.logger: 07-09 20:04:28] {2392} INFO -  at 69.7s,	estimator lgbm's best error=0.1685,	best estimator lgbm's best error=0.1685
[flaml.automl.logger: 07-09 20:04:28] {2219} INFO - iteration 35, current learner xgboost
[flaml.automl.logger: 07-09 20:04:30] {2392} INFO -  at 71.4s,	estimator xgboost's best error=0.1768,	best estimator lgbm's best error=0.1685
[flaml.automl.logger: 07-09 20:04:30] {2219} INFO - iteration 36, current learner extra_tree
[flaml.automl.logger: 07-09 20:04:31] {2392} INFO -  at 72.0s,	estimator extra_tree's best error=0.2056,	best estimator lgbm's best error=0.1685
[flaml.automl.logger: 07-09 20:04:31] {2219} INFO - iteration 37, current learner xgboost
[flaml.automl.logger: 07-09 20:04:33] {2392} INFO -  at 74.2s,	estimator xgboost's best error=0.1736,	best estimator lgbm's best error=0.1685
[flaml.automl.logger: 07-09 20:04:33] {2219} INFO - iteration 38, curren

[flaml.automl.logger: 07-09 20:07:43] {2219} INFO - iteration 70, current learner xgboost
[flaml.automl.logger: 07-09 20:07:59] {2392} INFO -  at 280.7s,	estimator xgboost's best error=0.1689,	best estimator lgbm's best error=0.1670
[flaml.automl.logger: 07-09 20:07:59] {2219} INFO - iteration 71, current learner xgboost
[flaml.automl.logger: 07-09 20:08:03] {2392} INFO -  at 284.4s,	estimator xgboost's best error=0.1689,	best estimator lgbm's best error=0.1670
[flaml.automl.logger: 07-09 20:08:03] {2219} INFO - iteration 72, current learner extra_tree
[flaml.automl.logger: 07-09 20:08:04] {2392} INFO -  at 285.8s,	estimator extra_tree's best error=0.1793,	best estimator lgbm's best error=0.1670
[flaml.automl.logger: 07-09 20:08:04] {2219} INFO - iteration 73, current learner xgboost
[flaml.automl.logger: 07-09 20:08:33] {2392} INFO -  at 314.1s,	estimator xgboost's best error=0.1681,	best estimator lgbm's best error=0.1670
[flaml.automl.logger: 07-09 20:08:33] {2219} INFO - iteration 

[flaml.automl.logger: 07-09 20:18:06] {2219} INFO - iteration 105, current learner lrl1
[flaml.automl.logger: 07-09 20:18:43] {2392} INFO -  at 924.1s,	estimator lrl1's best error=0.1857,	best estimator lgbm's best error=0.1667
[flaml.automl.logger: 07-09 20:18:43] {2219} INFO - iteration 106, current learner xgb_limitdepth
[flaml.automl.logger: 07-09 20:18:45] {2392} INFO -  at 926.7s,	estimator xgb_limitdepth's best error=0.1721,	best estimator lgbm's best error=0.1667
[flaml.automl.logger: 07-09 20:18:45] {2219} INFO - iteration 107, current learner lrl1
[flaml.automl.logger: 07-09 20:19:21] {2392} INFO -  at 962.8s,	estimator lrl1's best error=0.1857,	best estimator lgbm's best error=0.1667
[flaml.automl.logger: 07-09 20:19:21] {2219} INFO - iteration 108, current learner lrl1
[flaml.automl.logger: 07-09 20:19:59] {2392} INFO -  at 1000.8s,	estimator lrl1's best error=0.1857,	best estimator lgbm's best error=0.1667
[flaml.automl.logger: 07-09 20:19:59] {2219} INFO - iteration 109, 

[flaml.automl.logger: 07-09 20:33:27] {2392} INFO -  at 1808.8s,	estimator lgbm's best error=0.1667,	best estimator lgbm's best error=0.1667
[flaml.automl.logger: 07-09 20:33:27] {2219} INFO - iteration 140, current learner xgb_limitdepth
[flaml.automl.logger: 07-09 20:36:46] {2392} INFO -  at 2007.4s,	estimator xgb_limitdepth's best error=0.1673,	best estimator lgbm's best error=0.1667
[flaml.automl.logger: 07-09 20:36:46] {2219} INFO - iteration 141, current learner lgbm
[flaml.automl.logger: 07-09 20:38:48] {2392} INFO -  at 2129.3s,	estimator lgbm's best error=0.1667,	best estimator lgbm's best error=0.1667
[flaml.automl.logger: 07-09 20:38:48] {2219} INFO - iteration 142, current learner xgb_limitdepth
[flaml.automl.logger: 07-09 20:39:09] {2392} INFO -  at 2150.5s,	estimator xgb_limitdepth's best error=0.1673,	best estimator lgbm's best error=0.1667
[flaml.automl.logger: 07-09 20:39:09] {2219} INFO - iteration 143, current learner rf
[flaml.automl.logger: 07-09 20:39:15] {2392} I

[flaml.automl.logger: 07-09 21:10:41] {2219} INFO - iteration 175, current learner rf
[flaml.automl.logger: 07-09 21:10:45] {2392} INFO -  at 4046.4s,	estimator rf's best error=0.1747,	best estimator lgbm's best error=0.1661
[flaml.automl.logger: 07-09 21:10:45] {2219} INFO - iteration 176, current learner rf
[flaml.automl.logger: 07-09 21:10:51] {2392} INFO -  at 4052.1s,	estimator rf's best error=0.1747,	best estimator lgbm's best error=0.1661
[flaml.automl.logger: 07-09 21:10:51] {2219} INFO - iteration 177, current learner xgb_limitdepth
[flaml.automl.logger: 07-09 21:11:51] {2392} INFO -  at 4112.0s,	estimator xgb_limitdepth's best error=0.1673,	best estimator lgbm's best error=0.1661
[flaml.automl.logger: 07-09 21:11:51] {2219} INFO - iteration 178, current learner rf
[flaml.automl.logger: 07-09 21:11:57] {2392} INFO -  at 4118.9s,	estimator rf's best error=0.1745,	best estimator lgbm's best error=0.1661
[flaml.automl.logger: 07-09 21:11:57] {2219} INFO - iteration 179, current l

[flaml.automl.logger: 07-09 22:06:56] {2219} INFO - iteration 210, current learner lgbm
[flaml.automl.logger: 07-09 22:14:38] {2392} INFO -  at 7879.0s,	estimator lgbm's best error=0.1657,	best estimator lgbm's best error=0.1657
[flaml.automl.logger: 07-09 22:14:38] {2219} INFO - iteration 211, current learner xgboost
[flaml.automl.logger: 07-09 22:14:51] {2392} INFO -  at 7892.5s,	estimator xgboost's best error=0.1678,	best estimator lgbm's best error=0.1657
[flaml.automl.logger: 07-09 22:14:51] {2219} INFO - iteration 212, current learner extra_tree
[flaml.automl.logger: 07-09 22:15:22] {2392} INFO -  at 7923.5s,	estimator extra_tree's best error=0.1744,	best estimator lgbm's best error=0.1657
[flaml.automl.logger: 07-09 22:15:22] {2219} INFO - iteration 213, current learner rf
[flaml.automl.logger: 07-09 22:15:25] {2392} INFO -  at 7926.7s,	estimator rf's best error=0.1745,	best estimator lgbm's best error=0.1657
[flaml.automl.logger: 07-09 22:15:25] {2219} INFO - iteration 214, cur

[flaml.automl.logger: 07-09 23:04:04] {1932} INFO - Time taken to find the best model: 4342.200758934021


### AutoML fitting 결과

In [220]:
print(f'''
* Best ML : {automl.best_estimator}
* Best Hyperparameter config : 
   {automl.best_config}
* Best roc_auc_ovo on validation data : {1 - automl.best_loss:.4g}
* Trainning duration of best run : {automl.best_config_train_time:.4g}
''')


* Best ML : lgbm
* Best Hyperparameter config : 
   {'n_estimators': 1151, 'num_leaves': 65, 'min_child_samples': 16, 'learning_rate': 0.017956970950124464, 'log_max_bin': 5, 'colsample_bytree': 0.20276381302820615, 'reg_alpha': 0.008174668317161207, 'reg_lambda': 0.023014136232196607}
* Best roc_auc_ovo on validation data : 0.8343
* Trainning duration of best run : 18.74



In [228]:
automl.best_estimator

'lgbm'

### AutoML 두 번째 실험으로 test data 예측

In [222]:
test = test.drop("id", axis=1)

for col in re_cate : 
    col_desc = clean_desc(col)
    col_mapper = dict(col_desc)
    test[col] = test[col].map(col_mapper)
    
test.head()    

Unnamed: 0,Maritalstatus,Applicationmode,Applicationorder,Course,Daytimeeveningattendance,Previousqualification,Previousqualificationgrade,Nacionality,Mothersqualification,Fathersqualification,...,Curricularunits1stsemwithoutevaluations,Curricularunits2ndsemcredited,Curricularunits2ndsemenrolled,Curricularunits2ndsemevaluations,Curricularunits2ndsemapproved,Curricularunits2ndsemgrade,Curricularunits2ndsemwithoutevaluations,Unemploymentrate,Inflationrate,GDP
0,single,1st phase general contingent,1,Nursing,daytime,Secondary education,141.0,Portuguese;,Higher Education Degree,Secondary Education 12th Year of Schooling or Eq.,...,0,0,8,0,0,0.0,0,13.898438,-0.300049,0.790039
1,single,1st phase general contingent,1,Social Service,daytime,Secondary education,128.0,Portuguese;,Secondary Education 12th Year of Schooling or Eq.,Basic Education 3rd Cycle (9th/10th/11th Year)...,...,0,0,6,6,6,13.5,0,11.101562,0.600098,2.019531
2,single,1st phase general contingent,1,Social Service,daytime,Secondary education,118.0,Portuguese;,Secondary Education 12th Year of Schooling or Eq.,Basic Education 3rd Cycle (9th/10th/11th Year)...,...,0,0,6,11,5,11.0,0,15.5,2.800781,-4.058594
3,single,Technological specialization diploma holders,1,Management,daytime,Technological specialization course,130.0,Portuguese;,Secondary Education 12th Year of Schooling or Eq.,Basic Education 3rd Cycle (9th/10th/11th Year)...,...,0,3,8,14,5,11.0,0,8.898438,1.400391,3.509766
4,single,Over,1,Advertising and Marketing Management,daytime,Secondary education,110.0,Portuguese;,Secondary Education 12th Year of Schooling or Eq.,Basic education 1st cycle (4th/5th year) or eq...,...,0,0,6,9,4,10.664062,2,7.601562,2.599609,0.320068


In [225]:
y_pred_2 = automl.predict(test)
y_pred_2[:5]

array(['Dropout', 'Graduate', 'Graduate', 'Enrolled', 'Enrolled'],
      dtype=object)

In [226]:
automl_submission_2 = submission_df.copy()
automl_submission_2["Target"] = y_pred_2
automl_submission_2.head()

Unnamed: 0,id,Target
0,76518,Dropout
1,76519,Graduate
2,76520,Graduate
3,76521,Enrolled
4,76522,Enrolled


In [227]:
automl_submission_2.to_csv("./submission/automl_submission_2.csv", index=False)
os.listdir("./submission")

['automl_submission.csv',
 'automl_submission_2.csv',
 'binning_10_submission_df.csv',
 'binning_20_submission_df.csv',
 'binning_30_submission_df.csv',
 'binning_cate_submission_df.csv',
 'binning_cate_submission_df_2.csv',
 'binning_cate_submission_df_3.csv',
 'binning_logtrans_histGB_grid_cv_submission_df.csv',
 'binning_logtrans_histGB_grid_cv_submission_df_2.csv',
 'cap_temp_submission_df.csv',
 'histGB_best_estimator_dorp_association_features.csv',
 'histGB_best_estimator_feature_importances_69.csv',
 'histGB_best_estimator_feature_importances_70.csv',
 'histGB_best_estimator_feature_importances_79.csv',
 'histGB_best_estimator_feature_importance_119.csv',
 'histGB_best_estimator_submission_df.csv',
 'histGB_best_estimator_submission_df_2.csv',
 'histgb_grid_cv_best_model_submission_df.csv',
 'linearR_grid_cv_best_model_submission_df.csv',
 'log_trans_submission_df.csv',
 'nive_prior_submission_df.csv',
 'ordinal_interaction_select_submission_df.csv',
 'quant_cap_submission_df.cs

## 유용한 python 코드 

### zfill()

In [229]:
[f'SNP_{str(x).zfill(3)}_{str(x).zfill(2)}' for x in range(1, 16)]

['SNP_001_01',
 'SNP_002_02',
 'SNP_003_03',
 'SNP_004_04',
 'SNP_005_05',
 'SNP_006_06',
 'SNP_007_07',
 'SNP_008_08',
 'SNP_009_09',
 'SNP_010_10',
 'SNP_011_11',
 'SNP_012_12',
 'SNP_013_13',
 'SNP_014_14',
 'SNP_015_15']

### list의 원소 제거 방법 3

In [236]:
test = ["a", "b", "c", "d"]
test.remove("a")
test

['b', 'c', 'd']

In [234]:
test = ["a", "b", "c", "d"]
[t for t in test if t not in {"c", "d"}]

['a', 'b']

In [235]:
test = ["a", "b", "c", "d"]
test.pop(1) # index
test

['a', 'c', 'd']

In [237]:
test = ["a", "b", "c", "d"]
test = list(set(test).difference({"b", "d"}))
test

['a', 'c']