## Machine Learning Project 2022/2023 - Group 60

Group members:<p>
    Beatriz Carmo - 20220685 <p>
    João Malho - 20220696 <p>
    Lizaveta Baryionak - 20220667 <p>
    Marta Antunes - 20221094 <p>
    Tomás Silva - 20221639

In [1]:
%autosave 90

#basic libraries: numpy and pandasfor data handling, pyplot 
#and seaborn for visualization, math for mathematical operations
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
from scipy.stats import chi2_contingency


#dataset partition
from sklearn.model_selection import train_test_split

#feature selection methods
from sklearn.linear_model import LassoCV
from sklearn.feature_selection import RFE
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

#scaling methods and categorical variable encoder
from sklearn.preprocessing import RobustScaler, OneHotEncoder

#model selection 
from sklearn import model_selection
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV

#undersampling methods
#from imblearn.under_sampling import CondensedNearestNeighbour

import warnings
warnings.filterwarnings('ignore')

#ensemble classifier models
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

#model evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, \
make_scorer, classification_report, confusion_matrix, f1_score

from itertools import combinations
from collections import Counter

Autosaving every 90 seconds


### Loading Data (and joining it into one DataFrame)

In [2]:
#training data

demo_original=pd.read_excel('train_demo.xlsx')
habits_original=pd.read_excel('train_habits.xlsx')
health_original=pd.read_excel('train_health.xlsx')


In [3]:
#join all the testing data ino 1 dataframe
demo_original.set_index('PatientID')
health_original.set_index('PatientID')
habits_original.set_index('PatientID')

original=pd.merge(pd.merge(demo_original,habits_original,on='PatientID'),
                  health_original,on='PatientID')
original.set_index('PatientID', inplace=True)
df=original.copy()

In [4]:
#testing data

demo_test=pd.read_excel('train_demo.xlsx')
habits_test=pd.read_excel('train_habits.xlsx')
health_test=pd.read_excel('train_health.xlsx')


In [5]:
#joining the testing data into one dataframe

demo_test.set_index('PatientID')
health_test.set_index('PatientID')
habits_test.set_index('PatientID')

test_df_orig=pd.merge(pd.merge(demo_test,habits_test,on='PatientID'),
                  health_test,on='PatientID')
test_df_orig.set_index('PatientID', inplace=True)
test_df=test_df_orig.copy()

In [6]:
#checking how the testing dataframe is
df

Unnamed: 0_level_0,Name,Birth_Year,Region,Education,Disease,Smoking_Habit,Drinking_Habit,Exercise,Fruit_Habit,Water_Habit,Height,Weight,High_Cholesterol,Blood_Pressure,Mental_Health,Physical_Health,Checkup,Diabetes
PatientID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1167,Mrs. Stephanie Gay,1965,London,High School Incomplete (10th to 11th grade),1,No,I usually consume alcohol every day,Yes,Less than 1. I do not consume fruits every day.,Between one liter and two liters,155,67,358,120,21,2,More than 3 years,Neither I nor my immediate family have diabetes.
1805,Mr. Sherman Nero,1969,South West,High School Incomplete (10th to 11th grade),1,No,I consider myself a social drinker,Yes,Less than 1. I do not consume fruits every day.,Between one liter and two liters,173,88,230,142,9,0,Not sure,Neither I nor my immediate family have diabetes.
1557,Mr. Mark Boller,1974,Yorkshire and the Humber,Elementary School (1st to 9th grade),1,No,I consider myself a social drinker,No,Less than 1. I do not consume fruits every day.,More than half a liter but less than one liter,162,68,226,122,26,0,More than 3 years,Neither I nor my immediate family have diabetes.
1658,Mr. David Caffee,1958,London,University Complete (3 or more years),0,No,I usually consume alcohol every day,Yes,Less than 1. I do not consume fruits every day.,More than half a liter but less than one liter,180,66,313,125,13,8,Not sure,I have/had pregnancy diabetes or borderline di...
1544,Mr. Gerald Emery,1968,South East,University Incomplete (1 to 2 years),1,No,I consider myself a social drinker,No,1 to 2 pieces of fruit in average,More than half a liter but less than one liter,180,58,277,125,18,2,More than 3 years,I have/had pregnancy diabetes or borderline di...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1909,Mr. Philip Klink,1972,East Midlands,High School Incomplete (10th to 11th grade),0,No,I consider myself a social drinker,Yes,Less than 1. I do not consume fruits every day.,Between one liter and two liters,178,61,204,144,12,4,Not sure,Neither I nor my immediate family have diabetes.
1386,Mrs. Jackie Valencia,1980,North West,Elementary School (1st to 9th grade),1,No,I usually consume alcohol every day,No,Less than 1. I do not consume fruits every day.,Between one liter and two liters,157,61,213,120,23,0,More than 3 years,I have/had pregnancy diabetes or borderline di...
1088,Mrs. Cheryl Harris,1860,East Midlands,Elementary School (1st to 9th grade),0,No,I consider myself a social drinker,No,3 to 4 pieces of fruit in average,More than half a liter but less than one liter,167,48,272,140,20,17,More than 3 years,Neither I nor my immediate family have diabetes.
1662,Mr. Florencio Doherty,1975,East of England,Elementary School (1st to 9th grade),1,No,I usually consume alcohol every day,No,Less than 1. I do not consume fruits every day.,More than half a liter but less than one liter,165,75,208,112,16,0,More than 3 years,Neither I nor my immediate family have diabetes.


In [7]:
df['Name'].duplicated()

PatientID
1167    False
1805    False
1557    False
1658    False
1544    False
        ...  
1909    False
1386    False
1088    False
1662    False
1117    False
Name: Name, Length: 800, dtype: bool

In [10]:
df.loc[df.index == 'Mr. Gary Miller']

Unnamed: 0_level_0,Name,Birth_Year,Region,Education,Disease,Smoking_Habit,Drinking_Habit,Exercise,Fruit_Habit,Water_Habit,Height,Weight,High_Cholesterol,Blood_Pressure,Mental_Health,Physical_Health,Checkup,Diabetes
PatientID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1


In [None]:
df.sort_values(by = 'PatientID')

In [None]:
#df = df['Region'].str.lower()
#df

In [None]:
df['Region'].unique()

In [None]:
df['Drinking_Habit'].unique()

### Initial Data Exploration

**Information about the Data fields of our DataFrame**

PatientID - The unique identifier of the patient <p>
Birth Year - Patient Year of Birth <p>
Name - Name of the patient <p>
Region - Patient Living Region <p>
Education - Answer to the question: What is the highest grade or year of school you have? <p>
Disease - The dependent variable. If the patient has the disease (Disease = 1) or not (Disease = 0) <p>
Height - Patient’s height <p>
Weight - Patient’s weight <p>
Checkup - Answer to the question: How long has it been since you last visited a doctor for a routine Checkup? (A routine<p>
Checkup is a general physical exam, not an exam for a specific injury, illness, or condition.) <p>
Diabetes - Answer to the question: (Ever told) you or your direct relatives have diabetes? <p>
HighCholesterol - Cholesterol value <p>
BloodP ressure - Blood Pressure in rest value <p>
Mental Health - Answer to the question: During the past 30 days, for about how many days did poor physical or mental health keep you from doing your usual activities, such as self-care, work, or recreation?<p>
Physical Health - Answer to the question: Thinking about your physical health, which includes physical illness and injury,for how many days during the past 30 days was your physical health not good to the point where it was difficult to walk?<p>
Smoking Habit - Answer to the question: Do you smoke more than 10 cigars daily? <p>
Drinking Habit - Answer to the question: What is your behavior concerning alcohol consumption? <p>
Exercise - Answer to the question: Do you exercise (more than 30 minutes) 3 times per week or more? <p>
Fruit Habit - Answer to the question: How many portions of fruits do you consume per day? <p>
Water Habit - Answer to the question: How much water do you drink per day?

In [None]:
#delete duplicates
#put everything equel (lower casa)
#see f1 score /fscore
#see exercise, disease, smoking
#birth year --> age
#making intervals for ages