<p style="text-align: center;"><img src="https://lms.techproeducation.com/pluginfile.php/1/theme_edumy/headerlogo2/1663129929/logo.png" class="img-fluid" alt="CLRSWY"></p>

# WELCOME!

In this project, you must apply EDA processes for the development of predictive models. Handling outliers, domain knowledge and feature engineering will be challenges.

Also, this project aims to improve your ability to implement algorithms for Multi-Class Classification. Thus, you will have the opportunity to implement many algorithms commonly used for Multi-Class Classification problems.

Before diving into the project, please take a look at the determines and tasks.

# Determines

The 2012 US Army Anthropometric Survey (ANSUR II) was executed by the Natick Soldier Research, Development and Engineering Center (NSRDEC) from October 2010 to April 2012 and is comprised of personnel representing the total US Army force to include the US Army Active Duty, Reserves, and National Guard. In addition to the anthropometric and demographic data described below, the ANSUR II database also consists of 3D whole body, foot, and head scans of Soldier participants. These 3D data are not publicly available out of respect for the privacy of ANSUR II participants. The data from this survey are used for a wide range of equipment design, sizing, and tariffing applications within the military and has many potential commercial, industrial, and academic applications.

The ANSUR II working databases contain 93 anthropometric measurements which were directly measured, and 15 demographic/administrative variables explained below. The ANSUR II Male working database contains a total sample of 4,082 subjects. The ANSUR II Female working database contains a total sample of 1,986 subjects.


DATA DICT:
https://data.world/datamil/ansur-ii-data-dictionary/workspace/file?filename=ANSUR+II+Databases+Overview.pdf

---

To achieve high prediction success, you must understand the data well and develop different approaches that can affect the dependent variable.

Firstly, try to understand the dataset column by column using pandas module. Do research within the scope of domain (body scales, and race characteristics) knowledge on the internet to get to know the data set in the fastest way. 

You will implement ***Logistic Regression, Support Vector Machine, XGBoost, Random Forest*** algorithms. Also, evaluate the success of your models with appropriate performance metrics.

At the end of the project, choose the most successful model and try to enhance the scores with ***SMOTE*** make it ready to deploy. Furthermore, use ***SHAP*** to explain how the best model you choose works.

# Tasks

#### 1. Exploratory Data Analysis (EDA)
- Import Libraries, Load Dataset, Exploring Data

    *i. Import Libraries*
    
    *ii. Ingest Data *
    
    *iii. Explore Data*
    
    *iv. Outlier Detection*
    
    *v.  Drop unnecessary features*

#### 2. Data Preprocessing
- Scale (if needed)
- Separete the data frame for evaluation purposes

#### 3. Multi-class Classification
- Import libraries
- Implement SVM Classifer
- Implement Decision Tree Classifier
- Implement Random Forest Classifer
- Implement XGBoost Classifer
- Compare The Models



# EDA
- Drop unnecessary colums
- Drop DODRace class if value count below 500 (we assume that our data model can't learn if it is below 500)

## Import Libraries
Besides Numpy and Pandas, you need to import the necessary modules for data visualization, data preprocessing, Model building and tuning.

*Note: Check out the course materials.*

In [1]:
#pip install pyforest
#!pip install cufflinks

In [2]:
import pyforest
import pandas as pd
import numpy as np
import plotly
import cufflinks as cf
#Enabling the offline mode for interactive plotting locally
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
init_notebook_mode(connected=True)
cf.go_offline()
#To display the plots
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
from sklearn.metrics import make_scorer
from sklearn.metrics import classification_report,confusion_matrix,plot_confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curve, auc
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import OrdinalEncoder

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)

## Ingest Data from links below and make a dataframe
- Soldiers Male : https://query.data.world/s/h3pbhckz5ck4rc7qmt2wlknlnn7esr
- Soldiers Female : https://query.data.world/s/sq27zz4hawg32yfxksqwijxmpwmynq

In [3]:
male=pd.read_csv("ANSUR II MALE Public.csv", encoding='latin-1', quotechar = '"' , delimiter = ",")

In [4]:
male.head(2)

Unnamed: 0,subjectid,abdominalextensiondepthsitting,acromialheight,acromionradialelength,anklecircumference,axillaheight,balloffootcircumference,balloffootlength,biacromialbreadth,bicepscircumferenceflexed,bicristalbreadth,bideltoidbreadth,bimalleolarbreadth,bitragionchinarc,bitragionsubmandibulararc,bizygomaticbreadth,buttockcircumference,buttockdepth,buttockheight,buttockkneelength,buttockpopliteallength,calfcircumference,cervicaleheight,chestbreadth,chestcircumference,chestdepth,chestheight,crotchheight,crotchlengthomphalion,crotchlengthposterioromphalion,earbreadth,earlength,earprotrusion,elbowrestheight,eyeheightsitting,footbreadthhorizontal,footlength,forearmcenterofgriplength,forearmcircumferenceflexed,forearmforearmbreadth,forearmhandlength,functionalleglength,handbreadth,handcircumference,handlength,headbreadth,headcircumference,headlength,heelanklecircumference,heelbreadth,hipbreadth,hipbreadthsitting,iliocristaleheight,interpupillarybreadth,interscyei,interscyeii,kneeheightmidpatella,kneeheightsitting,lateralfemoralepicondyleheight,lateralmalleolusheight,lowerthighcircumference,mentonsellionlength,neckcircumference,neckcircumferencebase,overheadfingertipreachsitting,palmlength,poplitealheight,radialestylionlength,shouldercircumference,shoulderelbowlength,shoulderlength,sittingheight,sleevelengthspinewrist,sleeveoutseam,span,stature,suprasternaleheight,tenthribheight,thighcircumference,thighclearance,thumbtipreach,tibialheight,tragiontopofhead,trochanterionheight,verticaltrunkcircumferenceusa,waistbacklength,waistbreadth,waistcircumference,waistdepth,waistfrontlengthsitting,waistheightomphalion,weightkg,wristcircumference,wristheight,Gender,Date,Installation,Component,Branch,PrimaryMOS,SubjectsBirthLocation,SubjectNumericRace,Ethnicity,DODRace,Age,Heightin,Weightlbs,WritingPreference
0,10027,266,1467,337,222,1347,253,202,401,369,274,493,71,319,291,142,979,240,882,619,509,373,1535,291,1074,259,1292,877,607,351,36,71,19,247,802,101,273,349,299,575,477,1136,90,214,193,150,583,206,326,70,332,366,1071,685,422,441,502,560,500,77,391,118,400,436,1447,113,437,273,1151,368,145,928,883,600,1782,1776,1449,1092,610,164,786,491,140,919,1700,501,329,933,240,440,1054,815,175,853,Male,4-Oct-10,Fort Hood,Regular Army,Combat Arms,19D,North Dakota,1,,1,41,71,180,Right hand
1,10032,233,1395,326,220,1293,245,193,394,338,257,479,67,344,320,135,944,232,870,584,468,357,1471,269,1021,253,1244,851,615,376,33,62,18,232,781,98,263,348,289,523,476,1096,86,203,195,146,568,201,334,72,312,356,1046,620,441,447,490,540,488,73,371,131,380,420,1380,118,417,254,1119,353,141,884,868,564,1745,1702,1387,1076,572,169,822,476,120,918,1627,432,316,870,225,371,1054,726,167,815,Male,4-Oct-10,Fort Hood,Regular Army,Combat Support,68W,New York,1,,1,35,68,160,Left hand


In [5]:
male.columns  #108 tane columns var

Index(['subjectid', 'abdominalextensiondepthsitting', 'acromialheight', 'acromionradialelength', 'anklecircumference', 'axillaheight', 'balloffootcircumference', 'balloffootlength', 'biacromialbreadth', 'bicepscircumferenceflexed',
       ...
       'Branch', 'PrimaryMOS', 'SubjectsBirthLocation', 'SubjectNumericRace', 'Ethnicity', 'DODRace', 'Age', 'Heightin', 'Weightlbs', 'WritingPreference'], dtype='object', length=108)

In [6]:
female=pd.read_csv("ANSUR II FEMALE Public.csv", encoding='latin-1', quotechar = '"' , delimiter = ",")
female.head(2)

Unnamed: 0,SubjectId,abdominalextensiondepthsitting,acromialheight,acromionradialelength,anklecircumference,axillaheight,balloffootcircumference,balloffootlength,biacromialbreadth,bicepscircumferenceflexed,bicristalbreadth,bideltoidbreadth,bimalleolarbreadth,bitragionchinarc,bitragionsubmandibulararc,bizygomaticbreadth,buttockcircumference,buttockdepth,buttockheight,buttockkneelength,buttockpopliteallength,calfcircumference,cervicaleheight,chestbreadth,chestcircumference,chestdepth,chestheight,crotchheight,crotchlengthomphalion,crotchlengthposterioromphalion,earbreadth,earlength,earprotrusion,elbowrestheight,eyeheightsitting,footbreadthhorizontal,footlength,forearmcenterofgriplength,forearmcircumferenceflexed,forearmforearmbreadth,forearmhandlength,functionalleglength,handbreadth,handcircumference,handlength,headbreadth,headcircumference,headlength,heelanklecircumference,heelbreadth,hipbreadth,hipbreadthsitting,iliocristaleheight,interpupillarybreadth,interscyei,interscyeii,kneeheightmidpatella,kneeheightsitting,lateralfemoralepicondyleheight,lateralmalleolusheight,lowerthighcircumference,mentonsellionlength,neckcircumference,neckcircumferencebase,overheadfingertipreachsitting,palmlength,poplitealheight,radialestylionlength,shouldercircumference,shoulderelbowlength,shoulderlength,sittingheight,sleevelengthspinewrist,sleeveoutseam,span,stature,suprasternaleheight,tenthribheight,thighcircumference,thighclearance,thumbtipreach,tibialheight,tragiontopofhead,trochanterionheight,verticaltrunkcircumferenceusa,waistbacklength,waistbreadth,waistcircumference,waistdepth,waistfrontlengthsitting,waistheightomphalion,weightkg,wristcircumference,wristheight,Gender,Date,Installation,Component,Branch,PrimaryMOS,SubjectsBirthLocation,SubjectNumericRace,Ethnicity,DODRace,Age,Heightin,Weightlbs,WritingPreference
0,10037,231,1282,301,204,1180,222,177,373,315,263,466,65,338,301,141,1011,223,836,587,476,360,1336,274,922,245,1095,759,557,310,35,65,16,220,713,91,246,316,265,517,432,1028,75,182,184,141,548,191,314,69,345,388,966,645,363,399,435,496,447,55,404,118,335,368,1268,113,362,235,1062,327,148,803,809,513,1647,1560,1280,1013,622,174,736,430,110,844,1488,406,295,850,217,345,942,657,152,756,Female,5-Oct-10,Fort Hood,Regular Army,Combat Support,92Y,Germany,2,,2,26,61,142,Right hand
1,10038,194,1379,320,207,1292,225,178,372,272,250,430,64,294,270,126,893,186,900,583,483,350,1440,261,839,206,1234,835,549,329,32,60,23,208,726,91,249,341,247,468,463,1117,78,187,189,138,535,180,307,60,315,335,1048,595,340,375,483,532,492,69,334,115,302,345,1389,110,426,259,1014,346,142,835,810,575,1751,1665,1372,1107,524,152,771,475,125,901,1470,422,254,708,168,329,1032,534,155,815,Female,5-Oct-10,Fort Hood,Regular Army,Combat Service Support,25U,California,3,Mexican,3,21,64,120,Right hand


In [7]:
male.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4082 entries, 0 to 4081
Columns: 108 entries, subjectid to WritingPreference
dtypes: int64(99), object(9)
memory usage: 3.4+ MB


In [8]:
female = female.rename(columns = {"SubjectId":"subjectid"})

In [9]:
df = pd.concat([male,female])
df.head(3)

Unnamed: 0,subjectid,abdominalextensiondepthsitting,acromialheight,acromionradialelength,anklecircumference,axillaheight,balloffootcircumference,balloffootlength,biacromialbreadth,bicepscircumferenceflexed,bicristalbreadth,bideltoidbreadth,bimalleolarbreadth,bitragionchinarc,bitragionsubmandibulararc,bizygomaticbreadth,buttockcircumference,buttockdepth,buttockheight,buttockkneelength,buttockpopliteallength,calfcircumference,cervicaleheight,chestbreadth,chestcircumference,chestdepth,chestheight,crotchheight,crotchlengthomphalion,crotchlengthposterioromphalion,earbreadth,earlength,earprotrusion,elbowrestheight,eyeheightsitting,footbreadthhorizontal,footlength,forearmcenterofgriplength,forearmcircumferenceflexed,forearmforearmbreadth,forearmhandlength,functionalleglength,handbreadth,handcircumference,handlength,headbreadth,headcircumference,headlength,heelanklecircumference,heelbreadth,hipbreadth,hipbreadthsitting,iliocristaleheight,interpupillarybreadth,interscyei,interscyeii,kneeheightmidpatella,kneeheightsitting,lateralfemoralepicondyleheight,lateralmalleolusheight,lowerthighcircumference,mentonsellionlength,neckcircumference,neckcircumferencebase,overheadfingertipreachsitting,palmlength,poplitealheight,radialestylionlength,shouldercircumference,shoulderelbowlength,shoulderlength,sittingheight,sleevelengthspinewrist,sleeveoutseam,span,stature,suprasternaleheight,tenthribheight,thighcircumference,thighclearance,thumbtipreach,tibialheight,tragiontopofhead,trochanterionheight,verticaltrunkcircumferenceusa,waistbacklength,waistbreadth,waistcircumference,waistdepth,waistfrontlengthsitting,waistheightomphalion,weightkg,wristcircumference,wristheight,Gender,Date,Installation,Component,Branch,PrimaryMOS,SubjectsBirthLocation,SubjectNumericRace,Ethnicity,DODRace,Age,Heightin,Weightlbs,WritingPreference
0,10027,266,1467,337,222,1347,253,202,401,369,274,493,71,319,291,142,979,240,882,619,509,373,1535,291,1074,259,1292,877,607,351,36,71,19,247,802,101,273,349,299,575,477,1136,90,214,193,150,583,206,326,70,332,366,1071,685,422,441,502,560,500,77,391,118,400,436,1447,113,437,273,1151,368,145,928,883,600,1782,1776,1449,1092,610,164,786,491,140,919,1700,501,329,933,240,440,1054,815,175,853,Male,4-Oct-10,Fort Hood,Regular Army,Combat Arms,19D,North Dakota,1,,1,41,71,180,Right hand
1,10032,233,1395,326,220,1293,245,193,394,338,257,479,67,344,320,135,944,232,870,584,468,357,1471,269,1021,253,1244,851,615,376,33,62,18,232,781,98,263,348,289,523,476,1096,86,203,195,146,568,201,334,72,312,356,1046,620,441,447,490,540,488,73,371,131,380,420,1380,118,417,254,1119,353,141,884,868,564,1745,1702,1387,1076,572,169,822,476,120,918,1627,432,316,870,225,371,1054,726,167,815,Male,4-Oct-10,Fort Hood,Regular Army,Combat Support,68W,New York,1,,1,35,68,160,Left hand
2,10033,287,1430,341,230,1327,256,196,427,408,261,544,75,345,330,135,1054,258,901,623,506,412,1501,288,1120,267,1288,854,636,359,40,61,23,237,810,103,270,355,357,575,491,1115,93,220,203,148,573,202,356,70,349,393,1053,665,462,475,496,556,482,72,409,123,403,434,1447,121,431,268,1276,367,167,917,910,604,1867,1735,1438,1105,685,198,807,477,125,918,1678,472,329,964,255,411,1041,929,180,831,Male,4-Oct-10,Fort Hood,Regular Army,Combat Support,68W,New York,2,,2,42,68,205,Left hand


In [10]:
df.shape

(6068, 108)

In [11]:
df.isnull().sum()

subjectid                            0
abdominalextensiondepthsitting       0
acromialheight                       0
acromionradialelength                0
anklecircumference                   0
axillaheight                         0
balloffootcircumference              0
balloffootlength                     0
biacromialbreadth                    0
bicepscircumferenceflexed            0
bicristalbreadth                     0
bideltoidbreadth                     0
bimalleolarbreadth                   0
bitragionchinarc                     0
bitragionsubmandibulararc            0
bizygomaticbreadth                   0
buttockcircumference                 0
buttockdepth                         0
buttockheight                        0
buttockkneelength                    0
buttockpopliteallength               0
calfcircumference                    0
cervicaleheight                      0
chestbreadth                         0
chestcircumference                   0
chestdepth               

In [12]:
df.sort_values(by='DODRace',ascending=False).head(2)

Unnamed: 0,subjectid,abdominalextensiondepthsitting,acromialheight,acromionradialelength,anklecircumference,axillaheight,balloffootcircumference,balloffootlength,biacromialbreadth,bicepscircumferenceflexed,bicristalbreadth,bideltoidbreadth,bimalleolarbreadth,bitragionchinarc,bitragionsubmandibulararc,bizygomaticbreadth,buttockcircumference,buttockdepth,buttockheight,buttockkneelength,buttockpopliteallength,calfcircumference,cervicaleheight,chestbreadth,chestcircumference,chestdepth,chestheight,crotchheight,crotchlengthomphalion,crotchlengthposterioromphalion,earbreadth,earlength,earprotrusion,elbowrestheight,eyeheightsitting,footbreadthhorizontal,footlength,forearmcenterofgriplength,forearmcircumferenceflexed,forearmforearmbreadth,forearmhandlength,functionalleglength,handbreadth,handcircumference,handlength,headbreadth,headcircumference,headlength,heelanklecircumference,heelbreadth,hipbreadth,hipbreadthsitting,iliocristaleheight,interpupillarybreadth,interscyei,interscyeii,kneeheightmidpatella,kneeheightsitting,lateralfemoralepicondyleheight,lateralmalleolusheight,lowerthighcircumference,mentonsellionlength,neckcircumference,neckcircumferencebase,overheadfingertipreachsitting,palmlength,poplitealheight,radialestylionlength,shouldercircumference,shoulderelbowlength,shoulderlength,sittingheight,sleevelengthspinewrist,sleeveoutseam,span,stature,suprasternaleheight,tenthribheight,thighcircumference,thighclearance,thumbtipreach,tibialheight,tragiontopofhead,trochanterionheight,verticaltrunkcircumferenceusa,waistbacklength,waistbreadth,waistcircumference,waistdepth,waistfrontlengthsitting,waistheightomphalion,weightkg,wristcircumference,wristheight,Gender,Date,Installation,Component,Branch,PrimaryMOS,SubjectsBirthLocation,SubjectNumericRace,Ethnicity,DODRace,Age,Heightin,Weightlbs,WritingPreference
345,11751,198,1355,336,213,1254,236,185,403,365,251,485,70,337,320,142,992,241,863,585,463,380,1423,265,969,228,1220,810,613,376,35,65,29,201,755,95,254,348,312,534,477,1092,87,207,194,150,547,194,320,64,336,361,1007,650,406,423,485,532,475,72,395,120,368,401,1356,114,405,270,1117,369,148,855,883,599,1794,1651,1358,1053,609,181,795,459,122,890,1518,415,264,747,194,337,1016,697,177,767,Male,2-Nov-10,Fort Hood,Regular Army,Combat Service Support,74D,Guyana,18,Caribbean Islander,8,33,66,154,Right hand
449,12292,209,1379,344,220,1296,241,200,402,336,272,482,68,335,310,138,958,233,877,603,487,382,1485,253,934,225,1275,848,630,384,38,64,25,200,766,98,265,348,290,602,480,1144,82,204,197,150,560,199,336,65,334,355,1055,655,390,435,483,548,475,75,385,132,367,412,1459,114,430,261,1128,372,159,858,885,600,1873,1705,1410,1104,571,164,830,474,127,896,1556,434,300,826,195,360,1062,700,165,780,Male,23-Nov-10,Fort Bliss,Regular Army,Combat Service Support,91B,Arizona,318,Caribbean Islander Mexican,8,27,69,170,Right hand


In [13]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
subjectid,6068.0,20757.198418,13159.390894,10027.0,14841.75,20063.5,27234.5,920103.0
abdominalextensiondepthsitting,6068.0,246.468688,37.400106,155.0,219.0,242.0,271.0,451.0
acromialheight,6068.0,1406.161338,79.091048,1115.0,1350.0,1410.0,1462.0,1683.0
acromionradialelength,6068.0,327.374423,20.720018,249.0,313.0,328.0,341.25,393.0
anklecircumference,6068.0,224.891397,16.051833,156.0,214.0,225.0,235.0,293.0
axillaheight,6068.0,1299.608767,72.022286,1038.0,1249.0,1302.0,1349.0,1553.0
balloffootcircumference,6068.0,244.19265,16.84502,186.0,232.0,245.0,256.0,306.0
balloffootlength,6068.0,194.754614,13.516368,151.0,185.0,195.0,204.0,245.0
biacromialbreadth,6068.0,399.204186,30.236914,283.0,376.0,404.0,421.0,489.0
bicepscircumferenceflexed,6068.0,340.934245,41.519866,216.0,311.0,341.0,370.0,490.0


In [14]:
# function for set text color of positive
# values in Dataframes
def color_red(val):
    """
    Takes a scalar and returns a string with
    the css property `'color: red'` for positive
    strings, black otherwise.
    """
    if (1> val > 0.95) or (0.95< val < 1):
        color = 'red'
    elif val==1:
        color='blue'
    else:
        color = 'black' #jupiter note book kullananlar black yapmalı
    return 'color: %s' % color
pd.DataFrame(df).corr().style.applymap(color_red)

Unnamed: 0,subjectid,abdominalextensiondepthsitting,acromialheight,acromionradialelength,anklecircumference,axillaheight,balloffootcircumference,balloffootlength,biacromialbreadth,bicepscircumferenceflexed,bicristalbreadth,bideltoidbreadth,bimalleolarbreadth,bitragionchinarc,bitragionsubmandibulararc,bizygomaticbreadth,buttockcircumference,buttockdepth,buttockheight,buttockkneelength,buttockpopliteallength,calfcircumference,cervicaleheight,chestbreadth,chestcircumference,chestdepth,chestheight,crotchheight,crotchlengthomphalion,crotchlengthposterioromphalion,earbreadth,earlength,earprotrusion,elbowrestheight,eyeheightsitting,footbreadthhorizontal,footlength,forearmcenterofgriplength,forearmcircumferenceflexed,forearmforearmbreadth,forearmhandlength,functionalleglength,handbreadth,handcircumference,handlength,headbreadth,headcircumference,headlength,heelanklecircumference,heelbreadth,hipbreadth,hipbreadthsitting,iliocristaleheight,interpupillarybreadth,interscyei,interscyeii,kneeheightmidpatella,kneeheightsitting,lateralfemoralepicondyleheight,lateralmalleolusheight,lowerthighcircumference,mentonsellionlength,neckcircumference,neckcircumferencebase,overheadfingertipreachsitting,palmlength,poplitealheight,radialestylionlength,shouldercircumference,shoulderelbowlength,shoulderlength,sittingheight,sleevelengthspinewrist,sleeveoutseam,span,stature,suprasternaleheight,tenthribheight,thighcircumference,thighclearance,thumbtipreach,tibialheight,tragiontopofhead,trochanterionheight,verticaltrunkcircumferenceusa,waistbacklength,waistbreadth,waistcircumference,waistdepth,waistfrontlengthsitting,waistheightomphalion,weightkg,wristcircumference,wristheight,SubjectNumericRace,DODRace,Age,Heightin,Weightlbs
subjectid,1.0,-0.074702,-0.056287,-0.037505,-0.070446,-0.055057,-0.07151,0.00143,-0.08215,-0.070572,0.028671,-0.080712,-0.049426,-0.086444,-0.053103,-0.054291,-0.01512,-0.06999,-0.022891,-0.031233,0.042649,-0.049184,-0.065412,-0.019509,-0.069493,-0.032708,-0.04008,-0.022426,-0.044245,-0.188644,0.05083,-0.124125,-0.090568,-0.001614,-0.102278,-0.04549,-0.060872,-0.096064,-0.074443,-0.086128,-0.0478,-0.105716,-0.071225,-0.085648,-0.046868,-0.04734,-0.065551,-0.066234,-0.078628,-0.00507,-0.006978,-0.019887,-0.054759,-0.049501,-0.061439,-0.076019,-0.020008,-0.060229,0.02269,-0.108278,-0.024678,-0.038105,-0.082325,-0.09156,-0.181247,-0.051407,-0.008812,-0.049174,-0.091403,-0.065909,-0.103673,-0.075831,-0.047119,-0.02385,-0.095486,-0.064599,-0.072702,-0.002508,-0.030551,-0.105709,-0.102849,-0.072272,-0.005139,-0.046164,-0.082941,-0.039864,-0.062869,-0.065869,-0.047512,-0.149275,-0.044038,-0.066602,-0.101394,-0.037595,0.010501,0.021578,-0.046753,-0.054552,-0.070158
abdominalextensiondepthsitting,-0.074702,1.0,0.360623,0.321755,0.524747,0.290821,0.459174,0.34205,0.421544,0.69146,0.506497,0.724471,0.371894,0.529,0.621899,0.496898,0.741188,0.841581,0.258496,0.483275,0.333003,0.657287,0.346116,0.631068,0.826382,0.78045,0.304562,0.203336,0.466971,0.209593,0.247567,0.402031,0.132538,0.256314,0.28246,0.440374,0.354184,0.356183,0.629757,0.728564,0.317555,0.423842,0.426221,0.462871,0.311137,0.341047,0.370753,0.342171,0.50941,0.439029,0.597205,0.557642,0.268208,0.256375,0.603114,0.546482,0.288285,0.359487,0.235804,0.353888,0.673955,0.297767,0.658587,0.629364,0.329005,0.322198,0.163311,0.291144,0.639949,0.296521,0.192526,0.276912,0.429903,0.286739,0.327873,0.316876,0.361806,0.304103,0.732133,0.725401,0.396786,0.275044,0.18389,0.243939,0.729533,0.572025,0.859924,0.939899,0.958932,0.579296,0.162457,0.825714,0.550544,0.38922,0.021201,-0.079167,0.380614,0.300027,0.793634
acromialheight,-0.056287,0.360623,1.0,0.872475,0.512417,0.987452,0.693403,0.802922,0.735565,0.529353,0.409043,0.633088,0.705621,0.569411,0.581562,0.478282,0.302339,0.355987,0.870776,0.817915,0.755708,0.413845,0.98423,0.549632,0.538292,0.293602,0.960106,0.901554,0.353748,0.421327,0.365927,0.434054,0.268885,0.282664,0.81312,0.667315,0.830812,0.849715,0.640346,0.578186,0.866286,0.89528,0.711809,0.71906,0.758197,0.380922,0.442553,0.560746,0.800841,0.449487,0.227349,0.122086,0.934169,0.335677,0.531856,0.595748,0.899017,0.933643,0.860708,0.683209,0.337429,0.543181,0.630618,0.655042,0.869301,0.716936,0.895714,0.825665,0.69343,0.892212,0.5432,0.820647,0.874753,0.882861,0.886437,0.980269,0.985013,0.946931,0.294388,0.456801,0.807075,0.893164,0.319107,0.886774,0.739371,0.729273,0.446707,0.413142,0.371948,0.604967,0.933248,0.68461,0.734622,0.922687,-0.002789,-0.235121,0.078582,0.944577,0.702188
acromionradialelength,-0.037505,0.321755,0.872475,1.0,0.424626,0.862074,0.60597,0.734323,0.672158,0.461094,0.351546,0.566667,0.61699,0.51824,0.512804,0.425242,0.265856,0.306551,0.821045,0.783783,0.745158,0.357724,0.869814,0.475403,0.479231,0.278684,0.846182,0.855699,0.259435,0.320495,0.319147,0.365616,0.227197,-0.010985,0.641928,0.58422,0.756442,0.822701,0.563453,0.508698,0.841266,0.824941,0.630305,0.637453,0.725897,0.327257,0.400689,0.499182,0.715297,0.440114,0.200773,0.104683,0.863237,0.335555,0.476203,0.545767,0.836715,0.865876,0.817809,0.563735,0.292856,0.480474,0.551235,0.574643,0.795344,0.674063,0.834408,0.804744,0.626806,0.968584,0.516816,0.648899,0.874801,0.94131,0.896815,0.859657,0.866529,0.85459,0.25604,0.394819,0.805036,0.829854,0.283525,0.837744,0.59013,0.613148,0.384245,0.364133,0.338024,0.481149,0.851464,0.603964,0.646149,0.684979,0.011185,-0.201095,0.076888,0.831055,0.620289
anklecircumference,-0.070446,0.524747,0.512417,0.424626,1.0,0.469964,0.71172,0.569457,0.541868,0.643211,0.453308,0.645015,0.696482,0.524998,0.53344,0.486666,0.573783,0.585553,0.367548,0.509879,0.341578,0.817861,0.516963,0.56351,0.617938,0.492821,0.489737,0.386502,0.47136,0.386249,0.226444,0.369309,0.232208,0.301963,0.513834,0.652975,0.586232,0.462582,0.667928,0.616595,0.463116,0.522348,0.580401,0.614245,0.462628,0.385355,0.442149,0.449981,0.735602,0.53418,0.468214,0.41451,0.426804,0.241983,0.517326,0.529058,0.423444,0.514531,0.392554,0.494418,0.701673,0.408256,0.580308,0.586472,0.509908,0.46011,0.360672,0.406494,0.633587,0.433402,0.353401,0.524149,0.551094,0.416143,0.473478,0.517026,0.526091,0.452458,0.601652,0.654285,0.466278,0.407503,0.290894,0.378502,0.670108,0.513502,0.589723,0.569306,0.517324,0.479788,0.420109,0.74595,0.702178,0.514394,-0.015973,-0.15836,-0.025016,0.49977,0.73035
axillaheight,-0.055057,0.290821,0.987452,0.862074,0.469964,1.0,0.657168,0.790214,0.707868,0.465773,0.369885,0.566833,0.676731,0.527963,0.522729,0.426541,0.252569,0.296608,0.886279,0.815331,0.767906,0.361611,0.977075,0.482148,0.461842,0.231214,0.957121,0.917346,0.323353,0.409547,0.34538,0.392758,0.248002,0.232271,0.792602,0.634458,0.819695,0.849554,0.582199,0.50616,0.868659,0.888692,0.673259,0.67721,0.759898,0.341041,0.422081,0.540533,0.771206,0.424687,0.188045,0.090481,0.947847,0.325871,0.475033,0.551906,0.90587,0.935335,0.876211,0.654529,0.290956,0.518087,0.564017,0.594043,0.862626,0.711093,0.907812,0.829905,0.641337,0.882873,0.555291,0.799543,0.852703,0.878079,0.886305,0.975837,0.98093,0.955993,0.245669,0.405564,0.799415,0.906334,0.299336,0.902156,0.68201,0.682437,0.372178,0.337927,0.301106,0.565922,0.94819,0.621812,0.687252,0.905063,-0.001554,-0.22907,0.040748,0.93931,0.64223
balloffootcircumference,-0.07151,0.459174,0.693403,0.60597,0.71172,0.657168,1.0,0.745857,0.738107,0.66511,0.340423,0.717358,0.79988,0.658857,0.652885,0.599016,0.367364,0.467186,0.548577,0.585001,0.447819,0.622328,0.712699,0.586201,0.630597,0.377038,0.70168,0.594762,0.33093,0.361242,0.380798,0.471828,0.314341,0.246536,0.64838,0.922796,0.778583,0.678214,0.755462,0.668938,0.693205,0.666979,0.822658,0.838557,0.659535,0.492958,0.50898,0.584534,0.86196,0.634186,0.239246,0.148848,0.609406,0.375392,0.603736,0.64529,0.62381,0.697785,0.554896,0.644776,0.4922,0.569535,0.728734,0.728329,0.68449,0.628914,0.616439,0.633037,0.763675,0.639088,0.530731,0.656456,0.748401,0.632239,0.695105,0.709889,0.711016,0.617725,0.405478,0.573904,0.642618,0.590843,0.33894,0.568524,0.678087,0.641218,0.511055,0.493328,0.462661,0.544655,0.621127,0.731982,0.826903,0.628475,-0.010784,-0.108116,0.077855,0.691724,0.739781
balloffootlength,0.00143,0.34205,0.802922,0.734323,0.569457,0.790214,0.745857,1.0,0.743084,0.578376,0.300499,0.637175,0.769809,0.614841,0.593266,0.513421,0.272484,0.350413,0.740504,0.722116,0.655393,0.460475,0.822156,0.504492,0.525881,0.279675,0.817957,0.781305,0.24245,0.294316,0.399421,0.388274,0.25182,0.118358,0.636734,0.730299,0.960238,0.80985,0.684719,0.576168,0.851857,0.771855,0.743507,0.750693,0.79181,0.422691,0.475702,0.579723,0.867073,0.621401,0.162807,0.065882,0.773834,0.401841,0.544379,0.607266,0.787763,0.823582,0.735195,0.575923,0.364965,0.575413,0.653249,0.659296,0.725244,0.755341,0.788225,0.790825,0.713541,0.756173,0.565878,0.65188,0.826618,0.789731,0.822577,0.815345,0.814583,0.783657,0.301474,0.462312,0.757497,0.744534,0.320183,0.74354,0.60316,0.636491,0.391656,0.375003,0.358759,0.458554,0.783744,0.653566,0.74577,0.689499,-0.009037,-0.113672,0.030007,0.797004,0.671561
biacromialbreadth,-0.08215,0.421544,0.735565,0.672158,0.541868,0.707868,0.738107,0.743084,1.0,0.667874,0.286414,0.828207,0.724689,0.671039,0.669599,0.617132,0.266059,0.396636,0.611932,0.59446,0.486295,0.47261,0.782975,0.650182,0.645969,0.297557,0.76768,0.656954,0.261315,0.347465,0.395665,0.464477,0.337517,0.170978,0.699214,0.710599,0.764608,0.741715,0.765882,0.706602,0.750388,0.712667,0.770695,0.791861,0.657764,0.50278,0.477815,0.587502,0.787014,0.526022,0.138343,0.023911,0.668087,0.383087,0.715506,0.807178,0.677742,0.728257,0.582734,0.632104,0.335886,0.585125,0.791379,0.805074,0.770608,0.641709,0.69585,0.716581,0.901051,0.718646,0.777838,0.705771,0.8632,0.715036,0.807945,0.772618,0.764694,0.668883,0.310087,0.514456,0.708707,0.63708,0.34132,0.6258,0.659184,0.688361,0.477065,0.459779,0.430513,0.550404,0.692363,0.707292,0.803476,0.624097,-0.011861,-0.124767,0.068712,0.756733,0.722034
bicepscircumferenceflexed,-0.070572,0.69146,0.529353,0.461094,0.643211,0.465773,0.66511,0.578376,0.667874,1.0,0.390937,0.87502,0.603526,0.678673,0.709127,0.620775,0.60271,0.706307,0.405356,0.557966,0.393166,0.720978,0.541421,0.667327,0.841339,0.656603,0.518932,0.413365,0.408759,0.30796,0.3289,0.41411,0.231235,0.272876,0.476675,0.644539,0.586288,0.564829,0.914446,0.858786,0.545617,0.556341,0.678386,0.715128,0.495642,0.470375,0.460106,0.502121,0.707979,0.577342,0.423193,0.362678,0.434934,0.359924,0.708743,0.68385,0.459105,0.538737,0.383822,0.501735,0.666313,0.47795,0.818053,0.798084,0.517685,0.495292,0.392268,0.500239,0.857529,0.464036,0.398648,0.474455,0.649732,0.472182,0.541193,0.520296,0.54593,0.453494,0.682531,0.777149,0.544536,0.418247,0.2712,0.402665,0.73369,0.610489,0.712688,0.726686,0.697231,0.554691,0.415606,0.875688,0.765347,0.505623,-0.004081,-0.066587,0.217227,0.510416,0.862556


In [15]:
df.drop(['Age','subjectid','Weightlbs','Heightin','Weightlbs','Ethnicity','SubjectNumericRace','Component','Branch','PrimaryMOS','Installation','Date','WritingPreference'],inplace=True, axis=1)

In [16]:
df.columns[0:99]

Index(['abdominalextensiondepthsitting', 'acromialheight', 'acromionradialelength', 'anklecircumference', 'axillaheight', 'balloffootcircumference', 'balloffootlength', 'biacromialbreadth', 'bicepscircumferenceflexed', 'bicristalbreadth', 'bideltoidbreadth', 'bimalleolarbreadth', 'bitragionchinarc', 'bitragionsubmandibulararc', 'bizygomaticbreadth', 'buttockcircumference', 'buttockdepth', 'buttockheight', 'buttockkneelength', 'buttockpopliteallength', 'calfcircumference', 'cervicaleheight', 'chestbreadth', 'chestcircumference', 'chestdepth', 'chestheight', 'crotchheight', 'crotchlengthomphalion', 'crotchlengthposterioromphalion', 'earbreadth', 'earlength', 'earprotrusion', 'elbowrestheight', 'eyeheightsitting', 'footbreadthhorizontal', 'footlength', 'forearmcenterofgriplength', 'forearmcircumferenceflexed', 'forearmforearmbreadth', 'forearmhandlength', 'functionalleglength', 'handbreadth', 'handcircumference', 'handlength', 'headbreadth', 'headcircumference', 'headlength',
       'heel

In [17]:
df.shape

(6068, 96)

# EDA
- Drop unnecessary colums
- Drop DODRace class if value count below 500 (we assume that our data model can't learn if it is below 500)

In [18]:
df.DODRace.value_counts()

1    3792
2    1298
3     679
4     188
6      59
5      49
8       3
Name: DODRace, dtype: int64

In [19]:
df = df[df.DODRace < 4]



# DATA Preprocessing
- In this step we divide our data to X(Features) and y(Target) then ,
- To train and evaluation purposes we create train and test sets,
- Lastly, scale our data if features not in same scale. Why?

In [20]:
df.shape

(5769, 96)

# Modelling
- Fit the model with train dataset
- Get predict from vanilla model on both train and test sets to examine if there is over/underfitting   
- Apply GridseachCV for both hyperparemeter tuning and sanity test of our model.
- Use hyperparameters that you find from gridsearch and make final prediction and evaluate the result according to chosen metric.

In [21]:
def eval_metric(model, X_train, y_train, X_test, y_test):
    y_train_pred = model.predict(X_train)
    y_pred = model.predict(X_test)
    
    print("Test_Set")
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    print()
    print("Train_Set")
    print(confusion_matrix(y_train, y_train_pred))
    print(classification_report(y_train, y_train_pred))

## 1. Logistic model

### Vanilla Logistic Model

Vanilla modelin anlami; default degerler ile model kurulmasidir.

In [62]:
#df = df.join((df["Gender"].str.get_dummies(sep = ",").add_prefix("gender_")),drop_first =True)
#df = df.join(df["SubjectsBirthLocation"].str.get_dummies(sep = ",").add_prefix("SubBirthLoc_"),drop_first=True)

In [63]:
df = pd.get_dummies(df, drop_first =True)

In [64]:
df.columns == 'Gender'

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

In [128]:
#!pip install movecolumn
import movecolumn as mc
df=mc.MoveToLast(df,'DODRace')
df.head()

Unnamed: 0,abdominalextensiondepthsitting,acromialheight,acromionradialelength,anklecircumference,axillaheight,balloffootcircumference,balloffootlength,biacromialbreadth,bicepscircumferenceflexed,bicristalbreadth,bideltoidbreadth,bimalleolarbreadth,bitragionchinarc,bitragionsubmandibulararc,bizygomaticbreadth,buttockcircumference,buttockdepth,buttockheight,buttockkneelength,buttockpopliteallength,calfcircumference,cervicaleheight,chestbreadth,chestcircumference,chestdepth,chestheight,crotchheight,crotchlengthomphalion,crotchlengthposterioromphalion,earbreadth,earlength,earprotrusion,elbowrestheight,eyeheightsitting,footbreadthhorizontal,footlength,forearmcenterofgriplength,forearmcircumferenceflexed,forearmforearmbreadth,forearmhandlength,functionalleglength,handbreadth,handcircumference,handlength,headbreadth,headcircumference,headlength,heelanklecircumference,heelbreadth,hipbreadth,hipbreadthsitting,iliocristaleheight,interpupillarybreadth,interscyei,interscyeii,kneeheightmidpatella,kneeheightsitting,lateralfemoralepicondyleheight,lateralmalleolusheight,lowerthighcircumference,mentonsellionlength,neckcircumference,neckcircumferencebase,overheadfingertipreachsitting,palmlength,poplitealheight,radialestylionlength,shouldercircumference,shoulderelbowlength,shoulderlength,sittingheight,sleevelengthspinewrist,sleeveoutseam,span,stature,suprasternaleheight,tenthribheight,thighcircumference,thighclearance,thumbtipreach,tibialheight,tragiontopofhead,trochanterionheight,verticaltrunkcircumferenceusa,waistbacklength,waistbreadth,waistcircumference,waistdepth,waistfrontlengthsitting,waistheightomphalion,weightkg,wristcircumference,wristheight,Gender_Male,SubjectsBirthLocation_Alaska,SubjectsBirthLocation_Antigua and Barbuda,SubjectsBirthLocation_Argentina,SubjectsBirthLocation_Arizona,SubjectsBirthLocation_Arkansas,SubjectsBirthLocation_Azerbaijan,SubjectsBirthLocation_Bahamas,SubjectsBirthLocation_Barbados,SubjectsBirthLocation_Belarus,SubjectsBirthLocation_Belgium,SubjectsBirthLocation_Belize,SubjectsBirthLocation_Bermuda,SubjectsBirthLocation_Bolivia,SubjectsBirthLocation_Bosnia and Herzegovina,SubjectsBirthLocation_Brazil,SubjectsBirthLocation_British Virgin Islands,SubjectsBirthLocation_Bulgaria,SubjectsBirthLocation_California,SubjectsBirthLocation_Cameroon,SubjectsBirthLocation_Canada,SubjectsBirthLocation_Cape Verde,SubjectsBirthLocation_Chile,SubjectsBirthLocation_Colombia,SubjectsBirthLocation_Colorado,SubjectsBirthLocation_Connecticut,SubjectsBirthLocation_Costa Rica,SubjectsBirthLocation_Cuba,SubjectsBirthLocation_Delaware,SubjectsBirthLocation_Denmark,SubjectsBirthLocation_District of Columbia,SubjectsBirthLocation_Dominica,SubjectsBirthLocation_Dominican Republic,SubjectsBirthLocation_Ecuador,SubjectsBirthLocation_Egypt,SubjectsBirthLocation_El Salvador,SubjectsBirthLocation_Ethiopia,SubjectsBirthLocation_Florida,SubjectsBirthLocation_France,SubjectsBirthLocation_French Guiana,SubjectsBirthLocation_Georgia,SubjectsBirthLocation_Germany,SubjectsBirthLocation_Ghana,SubjectsBirthLocation_Grenada,SubjectsBirthLocation_Guadalupe,SubjectsBirthLocation_Guam,SubjectsBirthLocation_Guatemala,SubjectsBirthLocation_Guyana,SubjectsBirthLocation_Haiti,SubjectsBirthLocation_Hawaii,SubjectsBirthLocation_Honduras,SubjectsBirthLocation_Iceland,SubjectsBirthLocation_Idaho,SubjectsBirthLocation_Illinois,SubjectsBirthLocation_India,SubjectsBirthLocation_Indiana,SubjectsBirthLocation_Iowa,SubjectsBirthLocation_Iran,SubjectsBirthLocation_Iraq,SubjectsBirthLocation_Israel,SubjectsBirthLocation_Italy,SubjectsBirthLocation_Ivory Coast,SubjectsBirthLocation_Jamaica,SubjectsBirthLocation_Japan,SubjectsBirthLocation_Kansas,SubjectsBirthLocation_Kentucky,SubjectsBirthLocation_Kenya,SubjectsBirthLocation_Lebanon,SubjectsBirthLocation_Liberia,SubjectsBirthLocation_Louisiana,SubjectsBirthLocation_Maine,SubjectsBirthLocation_Maryland,SubjectsBirthLocation_Massachusetts,SubjectsBirthLocation_Mexico,SubjectsBirthLocation_Michigan,SubjectsBirthLocation_Minnesota,SubjectsBirthLocation_Mississippi,SubjectsBirthLocation_Missouri,SubjectsBirthLocation_Montana,SubjectsBirthLocation_Morocco,SubjectsBirthLocation_Nebraska,SubjectsBirthLocation_Netherlands,SubjectsBirthLocation_Nevada,SubjectsBirthLocation_New Hampshire,SubjectsBirthLocation_New Jersey,SubjectsBirthLocation_New Mexico,SubjectsBirthLocation_New York,SubjectsBirthLocation_New Zealand,SubjectsBirthLocation_Nicaragua,SubjectsBirthLocation_Nigeria,SubjectsBirthLocation_North Carolina,SubjectsBirthLocation_North Dakota,SubjectsBirthLocation_Ohio,SubjectsBirthLocation_Oklahoma,SubjectsBirthLocation_Oregon,SubjectsBirthLocation_Panama,SubjectsBirthLocation_Paraguay,SubjectsBirthLocation_Pennsylvania,SubjectsBirthLocation_Peru,SubjectsBirthLocation_Philippines,SubjectsBirthLocation_Poland,SubjectsBirthLocation_Portugal,SubjectsBirthLocation_Puerto Rico,SubjectsBirthLocation_Rhode Island,SubjectsBirthLocation_Romania,SubjectsBirthLocation_Russia,SubjectsBirthLocation_Saint Lucia,SubjectsBirthLocation_Senegal,SubjectsBirthLocation_Serbia,SubjectsBirthLocation_Sierra Leone,SubjectsBirthLocation_South Africa,SubjectsBirthLocation_South America,SubjectsBirthLocation_South Carolina,SubjectsBirthLocation_South Dakota,SubjectsBirthLocation_South Korea,SubjectsBirthLocation_Sri Lanka,SubjectsBirthLocation_Sudan,SubjectsBirthLocation_Syria,SubjectsBirthLocation_Tennessee,SubjectsBirthLocation_Texas,SubjectsBirthLocation_Togo,SubjectsBirthLocation_Trinidad and Tobago,SubjectsBirthLocation_Turkey,SubjectsBirthLocation_US Virgin Islands,SubjectsBirthLocation_Ukraine,SubjectsBirthLocation_United Kingdom,SubjectsBirthLocation_United States,SubjectsBirthLocation_Utah,SubjectsBirthLocation_Venezuela,SubjectsBirthLocation_Vermont,SubjectsBirthLocation_Virginia,SubjectsBirthLocation_Washington,SubjectsBirthLocation_West Virginia,SubjectsBirthLocation_Wisconsin,SubjectsBirthLocation_Wyoming,SubjectsBirthLocation_Zambia,DODRace
0,266,1467,337,222,1347,253,202,401,369,274,493,71,319,291,142,979,240,882,619,509,373,1535,291,1074,259,1292,877,607,351,36,71,19,247,802,101,273,349,299,575,477,1136,90,214,193,150,583,206,326,70,332,366,1071,685,422,441,502,560,500,77,391,118,400,436,1447,113,437,273,1151,368,145,928,883,600,1782,1776,1449,1092,610,164,786,491,140,919,1700,501,329,933,240,440,1054,815,175,853,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,233,1395,326,220,1293,245,193,394,338,257,479,67,344,320,135,944,232,870,584,468,357,1471,269,1021,253,1244,851,615,376,33,62,18,232,781,98,263,348,289,523,476,1096,86,203,195,146,568,201,334,72,312,356,1046,620,441,447,490,540,488,73,371,131,380,420,1380,118,417,254,1119,353,141,884,868,564,1745,1702,1387,1076,572,169,822,476,120,918,1627,432,316,870,225,371,1054,726,167,815,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,287,1430,341,230,1327,256,196,427,408,261,544,75,345,330,135,1054,258,901,623,506,412,1501,288,1120,267,1288,854,636,359,40,61,23,237,810,103,270,355,357,575,491,1115,93,220,203,148,573,202,356,70,349,393,1053,665,462,475,496,556,482,72,409,123,403,434,1447,121,431,268,1276,367,167,917,910,604,1867,1735,1438,1105,685,198,807,477,125,918,1678,472,329,964,255,411,1041,929,180,831,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2
3,234,1347,310,230,1239,262,199,401,359,262,518,73,328,309,143,991,242,821,560,437,395,1423,296,1114,262,1205,769,590,341,39,66,25,272,794,106,267,352,318,593,467,1034,91,217,194,158,576,199,341,68,338,367,986,640,458,461,460,511,452,76,393,106,407,446,1357,118,393,249,1155,330,148,903,848,550,1708,1655,1346,1021,604,180,803,445,127,847,1625,461,315,857,205,399,968,794,176,793,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1
4,250,1585,372,247,1478,267,224,435,356,263,524,80,340,310,138,1029,275,1080,706,567,425,1684,304,1048,232,1452,1014,682,382,32,56,19,188,814,111,305,399,324,605,550,1279,94,222,218,153,566,197,374,69,332,372,1251,675,481,505,612,666,585,85,458,135,398,430,1572,132,523,302,1231,400,180,919,995,641,2035,1914,1596,1292,672,194,962,584,122,1090,1679,467,303,868,214,379,1245,946,188,954,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2


In [129]:
X = df.drop(['DODRace'],axis=1)
y = df['DODRace']

In [130]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,stratify= y,random_state=42)

print("X Train features shape: {}\ny Train features shape: {}\nX Test features shape : {}\nY Test features shape : {}".format
      (X_train.shape, y_train.shape, X_test.shape, y_test.shape))

X Train features shape: (4615, 229)
y Train features shape: (4615,)
X Test features shape : (1154, 229)
Y Test features shape : (1154,)


In [131]:
df.shape

(5769, 230)

# Logistic- Scaling

In [132]:
from sklearn.preprocessing import MinMaxScaler

In [133]:
scaler =MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [134]:
df.head(1)

Unnamed: 0,abdominalextensiondepthsitting,acromialheight,acromionradialelength,anklecircumference,axillaheight,balloffootcircumference,balloffootlength,biacromialbreadth,bicepscircumferenceflexed,bicristalbreadth,bideltoidbreadth,bimalleolarbreadth,bitragionchinarc,bitragionsubmandibulararc,bizygomaticbreadth,buttockcircumference,buttockdepth,buttockheight,buttockkneelength,buttockpopliteallength,calfcircumference,cervicaleheight,chestbreadth,chestcircumference,chestdepth,chestheight,crotchheight,crotchlengthomphalion,crotchlengthposterioromphalion,earbreadth,earlength,earprotrusion,elbowrestheight,eyeheightsitting,footbreadthhorizontal,footlength,forearmcenterofgriplength,forearmcircumferenceflexed,forearmforearmbreadth,forearmhandlength,functionalleglength,handbreadth,handcircumference,handlength,headbreadth,headcircumference,headlength,heelanklecircumference,heelbreadth,hipbreadth,hipbreadthsitting,iliocristaleheight,interpupillarybreadth,interscyei,interscyeii,kneeheightmidpatella,kneeheightsitting,lateralfemoralepicondyleheight,lateralmalleolusheight,lowerthighcircumference,mentonsellionlength,neckcircumference,neckcircumferencebase,overheadfingertipreachsitting,palmlength,poplitealheight,radialestylionlength,shouldercircumference,shoulderelbowlength,shoulderlength,sittingheight,sleevelengthspinewrist,sleeveoutseam,span,stature,suprasternaleheight,tenthribheight,thighcircumference,thighclearance,thumbtipreach,tibialheight,tragiontopofhead,trochanterionheight,verticaltrunkcircumferenceusa,waistbacklength,waistbreadth,waistcircumference,waistdepth,waistfrontlengthsitting,waistheightomphalion,weightkg,wristcircumference,wristheight,Gender_Male,SubjectsBirthLocation_Alaska,SubjectsBirthLocation_Antigua and Barbuda,SubjectsBirthLocation_Argentina,SubjectsBirthLocation_Arizona,SubjectsBirthLocation_Arkansas,SubjectsBirthLocation_Azerbaijan,SubjectsBirthLocation_Bahamas,SubjectsBirthLocation_Barbados,SubjectsBirthLocation_Belarus,SubjectsBirthLocation_Belgium,SubjectsBirthLocation_Belize,SubjectsBirthLocation_Bermuda,SubjectsBirthLocation_Bolivia,SubjectsBirthLocation_Bosnia and Herzegovina,SubjectsBirthLocation_Brazil,SubjectsBirthLocation_British Virgin Islands,SubjectsBirthLocation_Bulgaria,SubjectsBirthLocation_California,SubjectsBirthLocation_Cameroon,SubjectsBirthLocation_Canada,SubjectsBirthLocation_Cape Verde,SubjectsBirthLocation_Chile,SubjectsBirthLocation_Colombia,SubjectsBirthLocation_Colorado,SubjectsBirthLocation_Connecticut,SubjectsBirthLocation_Costa Rica,SubjectsBirthLocation_Cuba,SubjectsBirthLocation_Delaware,SubjectsBirthLocation_Denmark,SubjectsBirthLocation_District of Columbia,SubjectsBirthLocation_Dominica,SubjectsBirthLocation_Dominican Republic,SubjectsBirthLocation_Ecuador,SubjectsBirthLocation_Egypt,SubjectsBirthLocation_El Salvador,SubjectsBirthLocation_Ethiopia,SubjectsBirthLocation_Florida,SubjectsBirthLocation_France,SubjectsBirthLocation_French Guiana,SubjectsBirthLocation_Georgia,SubjectsBirthLocation_Germany,SubjectsBirthLocation_Ghana,SubjectsBirthLocation_Grenada,SubjectsBirthLocation_Guadalupe,SubjectsBirthLocation_Guam,SubjectsBirthLocation_Guatemala,SubjectsBirthLocation_Guyana,SubjectsBirthLocation_Haiti,SubjectsBirthLocation_Hawaii,SubjectsBirthLocation_Honduras,SubjectsBirthLocation_Iceland,SubjectsBirthLocation_Idaho,SubjectsBirthLocation_Illinois,SubjectsBirthLocation_India,SubjectsBirthLocation_Indiana,SubjectsBirthLocation_Iowa,SubjectsBirthLocation_Iran,SubjectsBirthLocation_Iraq,SubjectsBirthLocation_Israel,SubjectsBirthLocation_Italy,SubjectsBirthLocation_Ivory Coast,SubjectsBirthLocation_Jamaica,SubjectsBirthLocation_Japan,SubjectsBirthLocation_Kansas,SubjectsBirthLocation_Kentucky,SubjectsBirthLocation_Kenya,SubjectsBirthLocation_Lebanon,SubjectsBirthLocation_Liberia,SubjectsBirthLocation_Louisiana,SubjectsBirthLocation_Maine,SubjectsBirthLocation_Maryland,SubjectsBirthLocation_Massachusetts,SubjectsBirthLocation_Mexico,SubjectsBirthLocation_Michigan,SubjectsBirthLocation_Minnesota,SubjectsBirthLocation_Mississippi,SubjectsBirthLocation_Missouri,SubjectsBirthLocation_Montana,SubjectsBirthLocation_Morocco,SubjectsBirthLocation_Nebraska,SubjectsBirthLocation_Netherlands,SubjectsBirthLocation_Nevada,SubjectsBirthLocation_New Hampshire,SubjectsBirthLocation_New Jersey,SubjectsBirthLocation_New Mexico,SubjectsBirthLocation_New York,SubjectsBirthLocation_New Zealand,SubjectsBirthLocation_Nicaragua,SubjectsBirthLocation_Nigeria,SubjectsBirthLocation_North Carolina,SubjectsBirthLocation_North Dakota,SubjectsBirthLocation_Ohio,SubjectsBirthLocation_Oklahoma,SubjectsBirthLocation_Oregon,SubjectsBirthLocation_Panama,SubjectsBirthLocation_Paraguay,SubjectsBirthLocation_Pennsylvania,SubjectsBirthLocation_Peru,SubjectsBirthLocation_Philippines,SubjectsBirthLocation_Poland,SubjectsBirthLocation_Portugal,SubjectsBirthLocation_Puerto Rico,SubjectsBirthLocation_Rhode Island,SubjectsBirthLocation_Romania,SubjectsBirthLocation_Russia,SubjectsBirthLocation_Saint Lucia,SubjectsBirthLocation_Senegal,SubjectsBirthLocation_Serbia,SubjectsBirthLocation_Sierra Leone,SubjectsBirthLocation_South Africa,SubjectsBirthLocation_South America,SubjectsBirthLocation_South Carolina,SubjectsBirthLocation_South Dakota,SubjectsBirthLocation_South Korea,SubjectsBirthLocation_Sri Lanka,SubjectsBirthLocation_Sudan,SubjectsBirthLocation_Syria,SubjectsBirthLocation_Tennessee,SubjectsBirthLocation_Texas,SubjectsBirthLocation_Togo,SubjectsBirthLocation_Trinidad and Tobago,SubjectsBirthLocation_Turkey,SubjectsBirthLocation_US Virgin Islands,SubjectsBirthLocation_Ukraine,SubjectsBirthLocation_United Kingdom,SubjectsBirthLocation_United States,SubjectsBirthLocation_Utah,SubjectsBirthLocation_Venezuela,SubjectsBirthLocation_Vermont,SubjectsBirthLocation_Virginia,SubjectsBirthLocation_Washington,SubjectsBirthLocation_West Virginia,SubjectsBirthLocation_Wisconsin,SubjectsBirthLocation_Wyoming,SubjectsBirthLocation_Zambia,DODRace
0,266,1467,337,222,1347,253,202,401,369,274,493,71,319,291,142,979,240,882,619,509,373,1535,291,1074,259,1292,877,607,351,36,71,19,247,802,101,273,349,299,575,477,1136,90,214,193,150,583,206,326,70,332,366,1071,685,422,441,502,560,500,77,391,118,400,436,1447,113,437,273,1151,368,145,928,883,600,1782,1776,1449,1092,610,164,786,491,140,919,1700,501,329,933,240,440,1054,815,175,853,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [135]:
from sklearn.linear_model import LogisticRegression
log_model = LogisticRegression(class_weight='balanced',max_iter=10000, random_state=42)  

#clas_weight bizim scorlarimizi degistirdi
#class_weight olmadan precision 0.70'lerdeydi.

In [136]:
log_model.fit(X_train_scaled,y_train)

LogisticRegression(class_weight='balanced', max_iter=10000, random_state=42)

In [137]:
y_pred = log_model.predict(X_test_scaled)

In [138]:
eval_metric(log_model, X_train_scaled, y_train, X_test_scaled, y_test)

Test_Set
[[636  18 104]
 [  7 243  10]
 [ 27  10  99]]
              precision    recall  f1-score   support

           1       0.95      0.84      0.89       758
           2       0.90      0.93      0.92       260
           3       0.46      0.73      0.57       136

    accuracy                           0.85      1154
   macro avg       0.77      0.83      0.79      1154
weighted avg       0.88      0.85      0.86      1154


Train_Set
[[2634   67  333]
 [  31  967   40]
 [  57   20  466]]
              precision    recall  f1-score   support

           1       0.97      0.87      0.92      3034
           2       0.92      0.93      0.92      1038
           3       0.56      0.86      0.67       543

    accuracy                           0.88      4615
   macro avg       0.81      0.89      0.84      4615
weighted avg       0.91      0.88      0.89      4615



In [150]:
f1_3 = make_scorer(f1_score, average = None, labels =["3"])
precision_3 = make_scorer(precision_score, average = None, labels =["3"])
recall_3 = make_scorer(recall_score, average = None, labels =["3"])

In [151]:
scores = cross_validate(log_model, X_train_scaled, y_train, scoring = ['accuracy', 'precision','recall',
                                                                   'f1', 'roc_auc'],cv = 10)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores_mean = df_scores.mean()[2:]
df_scores_mean

#cross validation yapiyoruz,bu train setinin sonuclari,biz bunlari üstteki test bölümüyle karsilastiracaz

test_accuracy    NaN
test_precision   NaN
test_recall      NaN
test_f1          NaN
test_roc_auc     NaN
dtype: float64

In [None]:
#Crosllu degerler

#burda hyper parametreleri Cross'lu kod ile deneye

# Logistic- with Pipeline

In [139]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [143]:
operations = [("scaler", MinMaxScaler()), ("log_pipeline", LogisticRegression(class_weight='balanced',max_iter=10000, random_state=42))]

In [144]:
Pipeline(steps=operations)

Pipeline(steps=[('scaler', MinMaxScaler()),
                ('log_pipeline',
                 LogisticRegression(class_weight='balanced', max_iter=10000,
                                    random_state=42))])

In [145]:
pipe_model = Pipeline(steps=operations)

In [146]:
pipe_model.fit(X_train, y_train)

Pipeline(steps=[('scaler', MinMaxScaler()),
                ('log_pipeline',
                 LogisticRegression(class_weight='balanced', max_iter=10000,
                                    random_state=42))])

In [147]:
y_pred = pipe_model.predict(X_test)

In [148]:
eval_metric(pipe_model, X_train, y_train, X_test, y_test)

Test_Set
[[636  18 104]
 [  7 243  10]
 [ 27  10  99]]
              precision    recall  f1-score   support

           1       0.95      0.84      0.89       758
           2       0.90      0.93      0.92       260
           3       0.46      0.73      0.57       136

    accuracy                           0.85      1154
   macro avg       0.77      0.83      0.79      1154
weighted avg       0.88      0.85      0.86      1154


Train_Set
[[2634   67  333]
 [  31  967   40]
 [  57   20  466]]
              precision    recall  f1-score   support

           1       0.97      0.87      0.92      3034
           2       0.92      0.93      0.92      1038
           3       0.56      0.86      0.67       543

    accuracy                           0.88      4615
   macro avg       0.81      0.89      0.84      4615
weighted avg       0.91      0.88      0.89      4615



asagidakiler Normal modelin scorlari 

![](2022-10-15-18-42-36.png)

### Logistic Model GridsearchCV

In [71]:
from sklearn.model_selection import GridSearchCV
model = LogisticRegression()

penalty = ["l1", "l2"]
C = np.logspace(-1, 5, 20)
class_weight= ["balanced", None]  
solver = ["liblinear", "saga"]

param_grid = {"penalty" : penalty,
              "C" : C,
              "class_weight":class_weight,
              "solver":solver}

grid_model = GridSearchCV(estimator=model,
                          param_grid=param_grid,
                          cv=10,
                          scoring = "recall",
                          n_jobs = -1) 
grid_model.fit(X_train_scaled,y_train)

Traceback (most recent call last):
  File "/Users/veyselaytekin/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/Users/veyselaytekin/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/Users/veyselaytekin/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "/Users/veyselaytekin/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1901, in recall_score
    _, r, _, _ = precision_recall_fscore_support(
  File "/Users/veyselaytekin/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1544, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
  File "/Users/veyselaytekin

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

In [None]:
grid_model.best_params_

In [None]:
eval_metric(grid_model, X_train_scaled, y_train, X_test_scaled, y_test)

# ROC  Curve

###Default Model

In [None]:
from sklearn.metrics import plot_roc_curve, plot_precision_recall_curve, roc_auc_score, auc, roc_curve, average_precision_score, precision_recall_curve

plot_roc_curve(log_model, X_train_scaled, y_train);   
plt.show()

#### Grid model

In [None]:
plot_roc_curve(grid_model, X_test_scaled, y_test);
plt.show()

In [None]:
y_pred_proba = log_model.predict_proba(X_train_scaled)
roc_auc_score(y_train, y_pred_proba[:,1])


fp_rate, tp_rate, thresholds = roc_curve(y_train, y_pred_proba[:,1]) #FalsePozitif_rate, True_Pozitif_rate

#YellowBrick  --

In [None]:
from yellowbrick.classifier import ROCAUC
model = grid_model
visualizer = ROCAUC(model)
visualizer.fit(X_train_scaled, y_train)        # Fit the training data to the visualizer
visualizer.score(X_test_scaled, y_test)        # Evaluate the model on the test data
visualizer.show();

In [None]:
from yellowbrick.classifier import PrecisionRecallCurve
model = grid_model
viz = PrecisionRecallCurve(model, per_class=True, cmap="Set1")
viz.fit(X_train_scaled, y_train)
viz.score(X_test_scaled, y_test)
viz.show();

## Lojistic Regression solver "liblenear" for small dataset

__liblenear__ kucuk datasetlerinde iyi sonuclar verir. Dikkat edilmesi gereken husus; liblinear sadece __penalty="l1"__ ile calisir, yoksa hata verir.

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

## 2. SVC

### Vanilla SVC model 

Vanilla modellerde default degerler verilmeli ama dengesiz bir datasetimiz oldugu icin  class_weight düşünülmeli

###  SVC Model GridsearchCV

## 3. RF

### OrdinalEncoder

In [22]:
X = df.drop(['DODRace'],axis=1)
y = df['DODRace']

In [23]:
X.head(3)

Unnamed: 0,abdominalextensiondepthsitting,acromialheight,acromionradialelength,anklecircumference,axillaheight,balloffootcircumference,balloffootlength,biacromialbreadth,bicepscircumferenceflexed,bicristalbreadth,bideltoidbreadth,bimalleolarbreadth,bitragionchinarc,bitragionsubmandibulararc,bizygomaticbreadth,buttockcircumference,buttockdepth,buttockheight,buttockkneelength,buttockpopliteallength,calfcircumference,cervicaleheight,chestbreadth,chestcircumference,chestdepth,chestheight,crotchheight,crotchlengthomphalion,crotchlengthposterioromphalion,earbreadth,earlength,earprotrusion,elbowrestheight,eyeheightsitting,footbreadthhorizontal,footlength,forearmcenterofgriplength,forearmcircumferenceflexed,forearmforearmbreadth,forearmhandlength,functionalleglength,handbreadth,handcircumference,handlength,headbreadth,headcircumference,headlength,heelanklecircumference,heelbreadth,hipbreadth,hipbreadthsitting,iliocristaleheight,interpupillarybreadth,interscyei,interscyeii,kneeheightmidpatella,kneeheightsitting,lateralfemoralepicondyleheight,lateralmalleolusheight,lowerthighcircumference,mentonsellionlength,neckcircumference,neckcircumferencebase,overheadfingertipreachsitting,palmlength,poplitealheight,radialestylionlength,shouldercircumference,shoulderelbowlength,shoulderlength,sittingheight,sleevelengthspinewrist,sleeveoutseam,span,stature,suprasternaleheight,tenthribheight,thighcircumference,thighclearance,thumbtipreach,tibialheight,tragiontopofhead,trochanterionheight,verticaltrunkcircumferenceusa,waistbacklength,waistbreadth,waistcircumference,waistdepth,waistfrontlengthsitting,waistheightomphalion,weightkg,wristcircumference,wristheight,Gender,SubjectsBirthLocation
0,266,1467,337,222,1347,253,202,401,369,274,493,71,319,291,142,979,240,882,619,509,373,1535,291,1074,259,1292,877,607,351,36,71,19,247,802,101,273,349,299,575,477,1136,90,214,193,150,583,206,326,70,332,366,1071,685,422,441,502,560,500,77,391,118,400,436,1447,113,437,273,1151,368,145,928,883,600,1782,1776,1449,1092,610,164,786,491,140,919,1700,501,329,933,240,440,1054,815,175,853,Male,North Dakota
1,233,1395,326,220,1293,245,193,394,338,257,479,67,344,320,135,944,232,870,584,468,357,1471,269,1021,253,1244,851,615,376,33,62,18,232,781,98,263,348,289,523,476,1096,86,203,195,146,568,201,334,72,312,356,1046,620,441,447,490,540,488,73,371,131,380,420,1380,118,417,254,1119,353,141,884,868,564,1745,1702,1387,1076,572,169,822,476,120,918,1627,432,316,870,225,371,1054,726,167,815,Male,New York
2,287,1430,341,230,1327,256,196,427,408,261,544,75,345,330,135,1054,258,901,623,506,412,1501,288,1120,267,1288,854,636,359,40,61,23,237,810,103,270,355,357,575,491,1115,93,220,203,148,573,202,356,70,349,393,1053,665,462,475,496,556,482,72,409,123,403,434,1447,121,431,268,1276,367,167,917,910,604,1867,1735,1438,1105,685,198,807,477,125,918,1678,472,329,964,255,411,1041,929,180,831,Male,New York


In [28]:
X.shape

(5769, 95)

In [29]:
categoric = X.select_dtypes('object').columns
categoric

Index(['Gender', 'SubjectsBirthLocation'], dtype='object')

In [30]:
enc = OrdinalEncoder()

X[categoric]= enc.fit_transform(X[categoric])
X[categoric].head(3)

Unnamed: 0,Gender,SubjectsBirthLocation
0,1.0,91.0
1,1.0,86.0
2,1.0,86.0


## Train Test Split

In [31]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

print("X Train features shape: {}\ny Train features shape: {}\nX Test features shape : {}\nY Test features shape : {}".format
      (X_train.shape, y_train.shape, X_test.shape, y_test.shape))

<IPython.core.display.Javascript object>

X Train features shape: (4615, 95)
y Train features shape: (4615,)
X Test features shape : (1154, 95)
Y Test features shape : (1154,)


In [32]:
X.isnull().sum().sum()

0

### Vanilla RF Model

In [36]:
rf_default = RandomForestClassifier(random_state=42)

In [37]:
rf_default.fit(X_train,y_train)

RandomForestClassifier(random_state=42)

In [38]:
eval_metric(rf_default,X_train,y_train,X_test,y_test)

Test_Set
[[717  13   3]
 [ 73 207   2]
 [120  12   7]]
              precision    recall  f1-score   support

           1       0.79      0.98      0.87       733
           2       0.89      0.73      0.81       282
           3       0.58      0.05      0.09       139

    accuracy                           0.81      1154
   macro avg       0.75      0.59      0.59      1154
weighted avg       0.79      0.81      0.76      1154


Train_Set
[[3059    0    0]
 [   0 1016    0]
 [   0    0  540]]
              precision    recall  f1-score   support

           1       1.00      1.00      1.00      3059
           2       1.00      1.00      1.00      1016
           3       1.00      1.00      1.00       540

    accuracy                           1.00      4615
   macro avg       1.00      1.00      1.00      4615
weighted avg       1.00      1.00      1.00      4615



In [51]:
scores = cross_validate(rf_default, X_train, y_train, scoring = ['accuracy', 'precision', 'recall', 'f1'], cv = 10)
df_scores = pd.DataFrame(scores)

df_scores.mean()[2:]

test_accuracy    NaN
test_precision   NaN
test_recall      NaN
test_f1          NaN
dtype: float64

### RF Model GridsearchCV

In [43]:
param_grid = {'n_estimators':[50, 64, 100, 128, 300],
             'max_features':[2, 3, 4, "auto"],
             'max_depth':[3, 5, 7, 9],
             'min_samples_split':[2, 5, 8]}

In [48]:
model = RandomForestClassifier(class_weight = "balanced", random_state=101)
rf_grid_model = GridSearchCV(model, param_grid, scoring = "f1", n_jobs = -1, verbose =1).fit(X_train, y_train)

Fitting 5 folds for each of 240 candidates, totalling 1200 fits


Traceback (most recent call last):
  File "/Users/veyselaytekin/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 761, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/Users/veyselaytekin/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 216, in __call__
    return self._score(
  File "/Users/veyselaytekin/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 264, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "/Users/veyselaytekin/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1123, in f1_score
    return fbeta_score(
  File "/Users/veyselaytekin/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", line 1261, in fbeta_score
    _, _, f, _ = precision_recall_fscore_support(
  File "/Users/veyselaytekin/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_classification.py", 

In [49]:
rf_grid_model.best_params_

{'max_depth': 3, 'max_features': 2, 'min_samples_split': 2, 'n_estimators': 50}

In [50]:
eval_metric(rf_grid_model, X_train, y_train, X_test, y_test)

Test_Set
[[344  84 305]
 [ 19 188  75]
 [ 31  14  94]]
              precision    recall  f1-score   support

           1       0.87      0.47      0.61       733
           2       0.66      0.67      0.66       282
           3       0.20      0.68      0.31       139

    accuracy                           0.54      1154
   macro avg       0.58      0.60      0.53      1154
weighted avg       0.74      0.54      0.59      1154


Train_Set
[[1398  296 1365]
 [  68  716  232]
 [ 111   44  385]]
              precision    recall  f1-score   support

           1       0.89      0.46      0.60      3059
           2       0.68      0.70      0.69      1016
           3       0.19      0.71      0.31       540

    accuracy                           0.54      4615
   macro avg       0.59      0.62      0.53      4615
weighted avg       0.76      0.54      0.59      4615



## yeni model kur best params ile

## 4. XGBoost

In [None]:
#start_time = time.time()
#xgb = XGBClassifier(n_estimators = 400, learning_rate = 0.1, max_depth = 3)
#xgb.fit(X_train.values, y_train)
#print('Fit time : ', time.time() - start_time)

In [None]:
#!pip uninstall xgboost 

#!pip install xgboost==0.90

Found existing installation: xgboost 1.6.2
Uninstalling xgboost-1.6.2:
  Would remove:
    /Users/veyselaytekin/opt/anaconda3/lib/python3.9/site-packages/xgboost-1.6.2.dist-info/*
    /Users/veyselaytekin/opt/anaconda3/lib/python3.9/site-packages/xgboost/*
Proceed (Y/n)? ^C
[31mERROR: Operation cancelled by user[0m
[31mERROR: Could not find a version that satisfies the requirement xgboost==0.90 (from versions: none)[0m
[31mERROR: No matching distribution found for xgboost==0.90[0m


### Vanilla XGBoost Model

In [19]:
X = df.drop(['DODRace'],axis=1)
y = df['DODRace']

In [20]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42,stratify= y)

print("X Train features shape: {}\ny Train features shape: {}\nX Test features shape : {}\nY Test features shape : {}".format
      (X_train.shape, y_train.shape, X_test.shape, y_test.shape))

X Train features shape: (4854, 96)
y Train features shape: (4854,)
X Test features shape : (1214, 96)
Y Test features shape : (1214,)


In [31]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(y_train)

In [32]:
xgb_model = XGBClassifier(random_state=42)    # new default model

In [33]:
xgb_model.fit(X_train,y_train)

ValueError: DataFrame.dtypes for data must be int, float, bool or category.  When
categorical type is supplied, DMatrix parameter `enable_categorical` must
be set to `True`. Invalid columns:Gender, SubjectsBirthLocation, WritingPreference

In [None]:
eval_metric(xgb_model,X_train,y_train,X_test,y_test)

### Cross Validation

In [None]:
model = XGBClassifier(random_state=42)

scores = cross_validate(model, X_train, y_train, scoring = ['accuracy', 'precision','recall',
                                                                   'f1', 'roc_auc'], cv = 10)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

#cross validation yapiyoruz,bu train setinin sonuclari,biz bunlari üstteki test bölümüyle karsilastiracaz

In [147]:
from xgboost import XGBClassifier

### XGBoost Model GridsearchCV

In [140]:
param_grid = {"n_estimators":[50, 100, 200],'max_depth':[3,4,5], "learning_rate": [0.1, 0.2],
             "subsample":[0.5, 0.8, 1], "colsample_bytree":[0.5,0.7, 1]}

In [141]:
xgb_model = XGBClassifier(random_state=42)

In [142]:
# target'ta continuous degerler bulmayacaz,onun icin rmse kullanmiyoruz
# saglik datasi degil, recall kullanmaya gerek yok
# classification yapacagimiz icin 'accuracy' yapmak daha iyi olabilir

In [143]:
xgb_grid = GridSearchCV(xgb_model,
                        param_grid,
                        scoring='f1',
                        verbose=2,
                        n_jobs=-1)

In [144]:
xgb_grid.fit(X_train,y_train)

Fitting 5 folds for each of 162 candidates, totalling 810 fits
[CV] END colsample_bytree=0.5, learning_rate=0.1, max_depth=3, n_estimators=50, subsample=0.5; total time=   0.0s[CV] END colsample_bytree=0.5, learning_rate=0.1, max_depth=3, n_estimators=50, subsample=0.5; total time=   0.0s

[CV] END colsample_bytree=0.5, learning_rate=0.1, max_depth=3, n_estimators=50, subsample=0.5; total time=   0.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, max_depth=3, n_estimators=50, subsample=0.5; total time=   0.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, max_depth=3, n_estimators=50, subsample=0.8; total time=   0.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, max_depth=3, n_estimators=50, subsample=0.5; total time=   0.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, max_depth=3, n_estimators=50, subsample=0.8; total time=   0.0s
[CV] END colsample_bytree=0.5, learning_rate=0.1, max_depth=3, n_estimators=50, subsample=0.8; total time=   0.0s
[CV] END colsample_bytree

ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1 2], got [1 2 3]

In [None]:
xgb_grid.best_params_

## Create a new model with hyperparameters

In [None]:
xgb_model = XGBClassifier()    #buraya hayperparametreler girecek

In [None]:
xgb_model.fit(X_train,y_train)

### Scoring

In [None]:
eval_metric(xgb_model,X_train,y_train,X_test,y_test)

## Cros Validation

In [None]:
scores = cross_validate(xgb_model, X_train, y_train, scoring = ['accuracy', 'precision','recall',
                                                                   'f1', 'roc_auc'], cv = 10)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

#cross validation yapiyoruz,bu train setinin sonuclari,biz bunlari üstteki test bölümüyle karsilastiracaz

# Features Importance

In [None]:
xgb_model.feature_importances_

In [None]:
xgb_features_importance= pd.DataFrame(index=X.columns,data=xgb_model.feature_importances_,columns=['xgb_model_features_importance'])
xgb_features_importance_sort = xgb_features_importance.sort_values(by='xgb_model_features_importance')
xgb_features_importance_sort

In [None]:
#bu adimdan sonra istersen uygun parametreler ile yola devam edebilirsin

## Comparing Models

# Before the Deployment 
- Choose the model that works best based on your chosen metric
- For final step, fit the best model with whole dataset to get better performance.
- And your model ready to deploy, dump your model and scaler.

- Evaluation metrics 
https://towardsdatascience.com/comprehensive-guide-on-multiclass-classification-metrics-af94cfb83fbd

## Altta yazılı olan başlıklar ile ilgili domain knowledge yapılarak çözüme gelinmeli

# SMOTE
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

Yukaridaki makale Smote islemini detayli bir sekilde anlatiyor. 

##  Smote implement

__Smote__ -----> Dengesiz data setlerinde datadaki en buyuk gozlem sayisi hangi class' a ait ise diger class' larin gozlem sayisini da ona esitler. Mesela datamizdaki en buyuk gozlem sayisi White = 3034 idi. Smote islemi uygulandiginda diger class' lardaki gozlem sayilari da buna esitlenir. Dengesiz datasetlerinde mutlaka denenmelidir :

## Logistic Regression Over/Under Sampling

#  SHAP

https://towardsdatascience.com/shap-explain-any-machine-learning-model-in-python-24207127cad7

## Shap values for all data

## SMOTE for X3 dataset

## Find the best threshold for multiclassification

<p style="text-align: center;"><img src="https://lms.techproeducation.com/pluginfile.php/1/theme_edumy/headerlogo2/1663129929/logo.png" class="img-fluid" alt="CLRSWY"></p>