<p><h1><b>World Happiness Report</b></h1></p>
<p><h2><b>Content</b></h2></p>
<ul>
    <a href='#1'><li>Introduction</li></a>
    <a href='#2'><li>Import Library</li></a>
    <a href='#3'><li>Exploratory Data Analysis</li></a>
    <a href='#4'><li>Data Visualization</li></a>
     <a href='#5'><li>Data Cleaning</li></a>
    <a href='#6'><li>Machine Learning</li></a>
        <ul>
                <a href='#7'><li>Logistic Regression</li></a>
            <a href='#8'><li>K-Nearest Neighbors</li></a>
            <a href='#9'><li>Naive Bayes</li></a>
            <a href='#10'><li>Gradient Boosting Machine</li></a>
            <a href='#11'><li>Random Forest</li></a>
            <a href='#12'><li>Decision Tree</li></a>
            <a href='#13'><li>Kernelized SVM</li></a>
        </ul>
</ul>

<p>last updated : <b>30.06.2019</b></p>
<p><h2><b>If you like it, please <i>upvote.</i></b></h2></p>

<p id='1'><b><h3>Indroduction</b></h3>
<p>The World Happiness Report is a landmark survey of the state of global happiness. The first report was published in 2012, the second in 2013, the third in 2015, and the fourth in the 2016 Update. The World Happiness 2017, which ranks 155 countries by their happiness levels, was released at the United Nations at an event celebrating International Day of Happiness on March 20th. The report continues to gain global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – describe how measurements of well-being can be used effectively to assess the progress of nations. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness.</p>

<p>Our aim here is to analyze the data set in detail and visualize it with a wide range of visualization tools.</p>

<p id='2'><b><h3>Import Library</b>

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from datetime import datetime

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer
from sklearn.model_selection import GridSearchCV,train_test_split,cross_val_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc
import os
import warnings
warnings.filterwarnings('ignore')
print(os.listdir("../input"))


data=pd.read_csv('../input/2015.csv')

<p id='3'><b><h3>Exploratory Data Analysis</b></h3>

In [None]:
print("Data Head So First 5 rows :\n",(data.head()))

print("Data Tail So Last 5 rows :\n",(data.tail()))         

In [None]:
data.sample(5)

In [None]:
print("Data Info:\n",(data.info()))

In [None]:
print("Data Describe:\n",(data.describe()))

In [None]:
print('Country Counts unique')
countries=data.Country.unique()
for country in countries:
    print(country)

In [None]:
print("Country Counts :\n")
print(data['Country'].value_counts())

In [None]:
data.dtypes

In [None]:
pd.isnull(data).sum()

In [None]:
data.dropna(how='any',axis='rows')
# null at any point and should be deleted.

In [None]:
#show random rows in dataset
data.sample(5)

In [None]:
print('Regions unqiue:\n')
regions=data.Region.unique()
for reg in regions:
    print(reg)

In [None]:
pd.isna(data['Region']).count()

In [None]:
for col in data.columns:
    print(data[data[col].isnull()])

In [None]:
print('Regions Counts:\n')
print(data['Region'].value_counts())

In [None]:
data.isnull().values.any()

<p id='4'><h3><b>Data Visualization</b></h3></p>

In [None]:
data_filter=data.iloc[:,1:9]
data_filter.columns
#Incomplete detection will be performed.
for i, col in enumerate(data_filter.columns.values):
    plt.subplot(4, 2, i+1)
    plt.scatter(np.arange(1,159), data[col].values.tolist())
    plt.title(col)
    fig, ax = plt.gcf(), plt.gca()
    fig.set_size_inches(10, 10)
    plt.tight_layout()
plt.show()

In [None]:
data.head()

In [None]:
data_region=data['Region'].value_counts()
data_rvalues=data_region.values
data_rregion=data_region.index

In [None]:

plt.figure(figsize=(10,10))
sns.barplot(x=data_rregion,y=data_rvalues)
plt.xticks(rotation=90)
plt.xlabel('Region')
plt.ylabel('Values')
plt.title('Region VS Values')
plt.show()

In [None]:
sns.countplot(data.Region)
plt.title('Region Count System')
plt.show()

In [None]:
sns.swarmplot(data.Region)
plt.show()

In [None]:
data.head()

In [None]:
plt.figure(figsize=(10,10))
ax=sns.barplot(x=data_rregion,y=data_rvalues,palette=sns.cubehelix_palette(len(data_rregion)))
plt.xlabel('Regions')
plt.ylabel('Values')
plt.xticks(rotation=90)
plt.title('Most Common Region of Happy')
plt.show()

In [None]:
# Draw a violinplot with a narrower bandwidth than the default
sns.violinplot(data=data.corr(), palette="Set3", bw=.2, cut=1, linewidth=1)

# Finalize the figure
ax.set(ylim=(-.7, 1.05))
sns.despine(left=True, bottom=True)
plt.xticks(rotation=90)
plt.show()

In [None]:
data.columns

In [None]:
plt.figure(figsize=(10,10))
plt.scatter(data['Economy (GDP per Capita)'], data['Freedom'], s=(data['Happiness Score']**3), alpha=0.5)
plt.grid(True)

plt.xlabel("Economy")
plt.ylabel("Freedom")

plt.suptitle("Health Economy graph with sizes as Happiness score and colors as Region", fontsize=18)

plt.show()

In [None]:
data.columns=['Country','Region','Happiness_Rank','Happiness_Score','Standart_Error','Economy_GPD_Capital','Family','Healt_Life_Expectancy','Freedom','Trust_Goverment_Corruption','Generosity','Dystopia_Residual']

In [None]:
data['Happiness_Rank'].unique()

In [None]:
values_region=data.Region.value_counts().values
values_region

In [None]:
Economy_GPD_Capital=[]
for i,region in enumerate(data.Region.value_counts().index):
    Economy_GPD_Capital.append(sum(data[data['Region']==region].Economy_GPD_Capital)/values_region[i])

In [None]:
Economy_GPD_Capital

In [None]:
data.head()

In [None]:
cmap = sns.cubehelix_palette(rot=-.2, as_cmap=True)
ax = sns.scatterplot(x="Happiness_Score", y="Standart_Error",
                     palette=cmap, sizes=(10, 200),
                     data=data)
plt.show()

In [None]:
# Pie Chart
colors = ['grey','blue','red','yellow','green','brown','lime','pink','orange','purple']
explode = [0,0.4,0,0.1,0,0,0,0,0,0.2]
plt.figure(figsize = (7,7))
plt.pie(Economy_GPD_Capital, explode=explode, labels=data['Region'].unique(), colors=colors, autopct='%1.1f%%')
plt.title('Regions Rates for Economy GPD',color = 'blue',fontsize = 15)
plt.show()

In [None]:
ayrimcol=[]

for col in data.columns:
    print(col)
    ayrimcol.append(col.split())

In [None]:
data.columns=['Country','Region','Happiness_Rank','Happiness_Score','Standart_Error','Economy_GPD_Capital','Family','Healt_Life_Expectancy','Freedom','Trust_Goverment_Corruption','Generosity','Dystopia_Residual']

happiess_ranks=[]
happiess_scores=[]
print('----------------------------------------------------------------------')
for i in data.Region.unique():
    happiess_ranks.append(sum(data[data['Region']==i].Happiness_Rank))
    happiess_scores.append(sum(data[data['Region']==i].Happiness_Score))

f,ax=plt.subplots(figsize=(10,10))
sns.barplot(y=data_rregion,x=happiess_ranks,color='green',alpha=0.5,label='Ranks')
sns.barplot(y=data_rregion,x=happiess_scores,color='red',alpha=0.7,label='Scores')
ax.legend(loc='lower right',frameon=True)
ax.set(xlabel='Happy of Region',ylabel='Happy of Rate',title='Ranks VS Scores')
plt.xticks(rotation=90)
plt.show()

In [None]:
data[data['Freedom']==0]
#so,the country with the lowest rate of freedom and the country with the highest

In [None]:
data[data['Freedom']>0.66]
#In addition, it seems to be the best country in Switzerland, both health and free.

In [None]:
#Now, according to the regions, the result of the vote of confidence vote in the government will be shown.
data.groupby('Region')['Trust_Goverment_Corruption'].mean()

In [None]:
pal=sns.cubehelix_palette(2,rot=.5,dark=.3)
sns.violinplot(data=data.groupby('Region')['Trust_Goverment_Corruption'].count().values, palette=pal, inner="points",color='b')
#sns.violinplot(data=data.groupby('Region')['Freedom'].count().values, palette=pal, inner="points",color='r')
plt.title('Region State for Trust_Goverment_Corruption ')
plt.xlabel('Region')
plt.ylabel('Frequency')
plt.show()

In [None]:
sns.pairplot(data)
plt.show()

In [None]:
g = sns.PairGrid(data, diag_sharey=False)
g.map_lower(sns.kdeplot)
g.map_upper(sns.scatterplot)
g.map_diag(sns.kdeplot, lw=3)
plt.show()

In [None]:
data.head()

In [None]:
sns.set(style="whitegrid")
ax = sns.boxplot(x=data["Trust_Goverment_Corruption"])
plt.show()

In [None]:
ax = sns.boxplot(y="Happiness_Score", x="Economy_GPD_Capital",data=data, linewidth=2.5)
plt.title('Happiness Score vs Economy GDP Capital')
plt.show()

In [None]:
sns.set(style="whitegrid")
ax = sns.boxplot(x=data["Happiness_Score"])
plt.show()

In [None]:
# Pie Chart
colors = ['grey','blue','red','yellow','green','brown','lime','pink','orange','purple']
explode = [0,0,0,0,0,0,0,0,0,0]
plt.figure(figsize = (7,7))
plt.pie(data.groupby('Region')['Trust_Goverment_Corruption'].mean().values, explode=explode, labels=data.groupby('Region')['Trust_Goverment_Corruption'].mean().index, colors=colors, autopct='%1.1f%%')
plt.title('Region vs Trust Goverment Corr',color = 'blue',fontsize = 15)
plt.show()

In [None]:
generosity=data.sort_values(by="Generosity",ascending="True")[:20].reset_index()
generosity=generosity.drop('index',axis=1)

In [None]:
#we will now analyze the rates and situations of generosity relative to countries.
generosity

In [None]:
sns.barplot(x=generosity.Region.value_counts().index,y=generosity.Region.value_counts().values)
plt.xlabel("Region")
plt.ylabel("Count")
plt.xticks(rotation=90)
plt.title("Top 20 most generous region rates")
plt.show()

In [None]:
f,ax1=plt.subplots(figsize=(10,10))
sns.pointplot(x=data_rregion,y=happiess_ranks,data=data,color='lime',alpha=0.8)
sns.pointplot(x=data_rregion,y=happiess_scores,data=data,color='red',alpha=0.8)
plt.xlabel('Region',fontsize=15,color='blue')
plt.ylabel('Rates',fontsize=15,color='blue')
plt.title('Region VS Rank and Scores Rates',fontsize=20,color='blue')
plt.xticks(rotation=90)
plt.grid()
plt.show()

In [None]:
colors = ['grey','blue','red','yellow','green','brown','lime','pink','orange','purple']
explode = [0,0,0,0,0,0,0,0,0,0]
plt.figure(figsize = (7,7))
plt.pie(data_rvalues, explode=explode, labels=data_rregion, colors=colors, autopct='%1.1f%%')
plt.title('Regions Rates',color = 'blue',fontsize = 15)
plt.show()

In [None]:
print(data_rvalues)
print(len(data_rregion))

In [None]:
turst_goverment_corr=[]

for col in data.Region.unique():
    turst_goverment_corr.append(sum(data[data['Region']==col].Trust_Goverment_Corruption))
    

plt.figure(figsize=(10,10))
sns.barplot(x=data_rregion,y=turst_goverment_corr)
plt.xticks(rotation=90)
plt.xlabel('Regions')
plt.ylabel('Goverment Corruptions')
plt.title('Corruption Rates')
plt.show()

In [None]:
plt.figure(figsize=(7,7))
plt.pie(turst_goverment_corr,explode=explode,labels=data_rregion,colors=colors,autopct='%1.1f%%')
plt.title('Region VS Goverment Corruption',color='blue',fontsize=15)
plt.show()

In [None]:
generosity=[]

for col in data.Region.unique():
    generosity.append(sum(data[data['Region']==col].Generosity))

plt.figure(figsize=(10,10))
sns.barplot(x=data_rregion,y=generosity)
plt.xticks(rotation=45)
plt.xlabel('Region')
plt.ylabel('Generosity Rates')
plt.title('Region VS Generosity Rates')
plt.show()

In [None]:
min_d=generosity[0]
max_d=generosity[0]
i=0
min_i=0
max_i=0
for d in generosity:
    if min_d>generosity[i]:
        min_d=generosity[i]
        min_i=i
    elif max_d<generosity[i]:
        max_d=generosity[i]
        max_i=i
    i=i+1

print(data_rregion[0])
print('The most generosity Rates :'+(data_rregion[max_i]))
print('The less generosity Rates :'+(data_rregion[min_i]))

In [None]:
freedom=[]
for c in data.Region.unique():
    freedom.append(sum(data[data['Region']==c].Freedom))

plt.figure(figsize=(10,10))
sns.barplot(x=data_rregion,y=freedom)
plt.xticks(rotation=90)
plt.xlabel('Reigon')
plt.ylabel('Freedom Rate')
plt.title('Region VS Freedom Rates')
plt.show()

In [None]:
f2,ax2=plt.subplots(figsize=(9,15))
sns.barplot(x=freedom,y=data_rregion,label='Freedom',color='green',alpha=0.5)
sns.barplot(x=generosity,y=data_rregion,label='Generosity',color='red',alpha=0.7)
sns.barplot(x=turst_goverment_corr,y=data_rregion,label='Trust Goverment Corr',color='blue',alpha=0.9)
ax2.legend(loc='lower right',frameon = True)
ax2.set(xlabel='Percentage of Rates', ylabel='Region',title = "Rates vs Frequency")
plt.show()

In [None]:
economy_gpd_capital=[]
for c in data.Region.unique():
    economy_gpd_capital.append(sum(data[data['Region']==c].Economy_GPD_Capital))


economy_gpd_capital.sort(reverse=True)

plt.figure(figsize=(10,10))
sns.barplot(x=data_rregion,y=economy_gpd_capital)
plt.xticks(rotation=90)
plt.xlabel('Region')
plt.ylabel('Economy GPD Region')
plt.title('Region VS Economy')
plt.show()    


In [None]:
sns.countplot(data.Region)
plt.title('Data Region Counts')
plt.xticks(rotation=90)
plt.show() 


In [None]:
data.groupby('Region')[['Healt_Life_Expectancy','Freedom','Trust_Goverment_Corruption','Generosity']].mean()
#It seems that Australia and New Zealand seem to be the most appropriate region.

In [None]:
data.head()

In [None]:
for i,col in enumerate(data.columns[3:]):
    plt.subplot(3,3,i+1)
    ax = sns.distplot(data[col], rug=True, hist=False)
    plt.title(col)
    fig, ax = plt.gcf(), plt.gca()
    fig.set_size_inches(10, 10)
    plt.tight_layout()
plt.show()

In [None]:
f, ax = plt.subplots(figsize=(6.5, 6.5))
sns.despine(f, left=True, bottom=True)
sns.scatterplot(x="Generosity", y="Freedom",data=data)
plt.show()

In [None]:
#        columns=['a','b','c','d','e','f'])

regions=pd.DataFrame(data.Region.unique(),index=range(10),columns=['Region'])
economy=pd.DataFrame(economy_gpd_capital,index=range(10),columns=['Economy'])
freedoms=pd.DataFrame(freedom,index=range(10),columns=['Freedom'])
generositys=pd.DataFrame(generosity,index=range(10),columns=['Generosity'])
happiess_rankss=pd.DataFrame(happiess_ranks,index=range(10),columns=['Happy Ranks'])
happiess_scoress=pd.DataFrame(happiess_scores,index=range(10),columns=['Happy Scores'])

new_data=pd.concat([regions,economy,freedoms,generositys,happiess_rankss,happiess_scoress],axis=1)

sns.pairplot(new_data)
plt.show()

In [None]:
new_data.head()

In [None]:
sns.boxplot(x='Region',y='Dystopia_Residual',data=data,palette='PRGn')
plt.xticks(rotation=90)
plt.show()


In [None]:
print(new_data.Region)

In [None]:
data.head()

In [None]:
sns.heatmap(data.iloc[:,2:].corr())
plt.show()

In [None]:
data.isna().values.any()

In [None]:
data.isna().sum()

In [None]:
data.isnull().sum()

In [None]:
Happiness_Score=[]
for region in data.Region.unique():
    Happiness_Score.append(sum(data[data['Region']==region].Happiness_Score))
    
fig, ax = plt.subplots()
ax.scatter(x = data['Region'].unique(), y =Happiness_Score)
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.xticks(rotation=90)
plt.show()

In [None]:
data.Region.value_counts()
plt.figure(figsize=(5,5))
plt.scatter(x=data[data['Region']=="Sub-Saharan Africa"].Economy_GPD_Capital,y=data[data['Region']=="Sub-Saharan Africa"].Happiness_Score)
plt.show()

In [None]:
plt.subplots(figsize=(12, 8))
top_corr = data[abs(data['Happiness_Rank']>6.5)].corr()
sns.heatmap(top_corr, annot=True)
plt.show()

In [None]:
data.columns

In [None]:
for i,col in enumerate(data.columns[2:]):
    plt.subplot(5,2 ,i+1)
    sns.distplot(data[col])
    plt.title(col)
    fig, ax = plt.gcf(), plt.gca()
    fig.set_size_inches(10, 10)
    plt.tight_layout()
plt.show()

In [None]:
data.columns

In [None]:
plt.figure(figsize=(14,4))

plt.subplot(1,3,1)
sns.barplot(x = 'Happiness_Rank', y = 'Healt_Life_Expectancy', data = data[:100])
plt.xticks(rotation=90)

plt.subplot(1,3,2)
sns.barplot(x = 'Happiness_Rank', y = 'Healt_Life_Expectancy', data = data[:100])
plt.xticks(rotation=90)

plt.subplot(1,3,3)
sns.barplot(x = 'Happiness_Rank', y = 'Healt_Life_Expectancy', data = data[:100])
plt.xticks(rotation=90)

plt.tight_layout()
plt.show()

In [None]:
sns.heatmap(data.corr(), annot = True, cmap='inferno')
plt.show()

In [None]:
plt.scatter(x=np.arange(1,159),y=data['Freedom'],color='r')
plt.scatter(x=np.arange(1,159),y=data['Economy_GPD_Capital'],color='b')
plt.title('Freedom vs Economy_GPD_Capital')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
sns.distplot(data['Freedom'], kde = False, color='m', bins = 30)
plt.ylabel('Frequency')
plt.title('Freedom Distribution')
plt.show()

In [None]:
plt.figure(figsize=(8,5))
sns.distplot(data['Economy_GPD_Capital'], kde = False, color='r', bins = 30)
plt.ylabel('Frequency')
plt.title('Reading Score Distribution')
plt.show()

In [None]:
plt.scatter(x=data['Freedom'],y=data['Happiness_Score'],color='r')
plt.scatter(x=data['Freedom'],y=data['Economy_GPD_Capital'],color='b')
plt.show()

In [None]:
data.columns[2:]

In [None]:
plt.scatter(x=data['Trust_Goverment_Corruption'],y=data['Happiness_Score'],color='r')
plt.scatter(x=data['Trust_Goverment_Corruption'],y=data['Economy_GPD_Capital'],color='b')
plt.scatter(x=data['Trust_Goverment_Corruption'],y=data['Generosity'],color='y')
plt.xlabel('Trust_Goverment_Corruption')
plt.ylabel('Happiness_Score , Economy_GPD_Capital , Generosity')
plt.show()

<p id='5'><b><h3>Data Cleaning</h3></b>

In [None]:
#There is no need for Country value. This is because there is only one order of a value and a value.
data=data.drop('Country',axis=1)

In [None]:
data.head()

In [None]:
#There is more than one region value. For this, we need to make all the regions sorted.
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
data['Region']=le.fit_transform(data['Region'])

In [None]:
data.head()

In [None]:
#Happiness Rank value Because it is an ordered value, we need to delete it.
data=data.drop('Happiness_Rank',axis=1)

In [None]:
#I'll create a value now. For this value we will create an index value among themselves. For this value, 75% will be treated as 1 and 0 will be treated as 0. This value is the value of Healt_Life_Expectancy.
data['Healt_Life_Expectancy']=[1 if healt_value>0.75 else 0 for healt_value in data.Healt_Life_Expectancy]

In [None]:
# value>0.75---->1  value<0.75------>0 
data.Healt_Life_Expectancy.value_counts()
sns.countplot(data.Healt_Life_Expectancy)
plt.show()

In [None]:
data_healt_life=data['Healt_Life_Expectancy']
data=data.drop('Healt_Life_Expectancy',axis=1)

In [None]:
row_id=1
for col in data.columns:
    if col!='Region':
        plt.subplot(4,2,row_id)
        row_id=row_id+1
        plt.title(col)
        plt.scatter(x=np.arange(1,159),y=data[col],color='b')
        fig, ax = plt.gcf(), plt.gca()
        fig.set_size_inches(10, 10)
        plt.tight_layout()
        plt.show()

#As can be seen, the operations at the bottom are outlier data.
#There are outlier data in a number of regions. It takes this kind of process to better analyze the data.

In [None]:
data.columns

In [None]:
len(data[(data['Region']==8)].sort_values(by='Economy_GPD_Capital',ascending=False).iloc[:,3].values)

In [None]:
sns.barplot(np.arange(1,41),data[(data['Region']==8)].sort_values(by='Economy_GPD_Capital',ascending=False).iloc[:,3].values)
plt.xticks(rotation=90)
plt.xlabel('Ranges')
plt.ylabel('Values')
plt.show()

In [None]:
# Plot miles per gallon against horsepower with other semantics
sns.relplot(x="Economy_GPD_Capital", y="Freedom",
            sizes=(40, 400), alpha=.5, palette="muted",
            height=6, data=data)
plt.show()

In [None]:
# Show each distribution with both violins and points
sns.violinplot(data=data,inner="points")
plt.xticks(rotation=90)
plt.show()

In [None]:
g = sns.jointplot("Happiness_Score", "Family", data=data, kind="reg",
                  xlim=(0, 60), ylim=(0, 12), color="m", height=7)

plt.show()

In [None]:
data.head()

In [None]:
data[data['Region']==9].corr()

In [None]:

fig, axs = plt.subplots(2, 2, figsize=(5, 5))
axs[0, 0].hist(data['Happiness_Score'])
axs[1, 0].scatter(data['Standart_Error'], data['Economy_GPD_Capital'])
axs[0, 1].plot(data['Trust_Goverment_Corruption'], data['Freedom'])
axs[1, 1].hist2d(data['Happiness_Score'], data['Economy_GPD_Capital'])

plt.show()

In [None]:
data.columns

In [None]:
data.head()

In [None]:
data_new=data.copy()

In [None]:
data_new.sample(5)

In [None]:
data_new.corr()

In [None]:
data_new["NewFeature"]=data_new.Trust_Goverment_Corruption*data_new.Freedom

In [None]:
sns.heatmap(data_new.corr(),annot=True,fmt='.1f')
plt.show()

In [None]:
data[data['Region']==0]

In [None]:
data[data['Region']==6]

In [None]:
data[data['Region']==6].groupby('Region')['Economy_GPD_Capital'].mean()

In [None]:
import statsmodels.formula.api as sm
model=sm.OLS(data.iloc[:,-1],data.iloc[:,:-1]).fit()
model.summary()

In [None]:
#I've divided the values to make better analysis of all values. Then, it is necessary to rank data between Healt values.
#Normally we need to reserve train_test_split values first. Then we need to use the StandardScaler function.
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(data,data_healt_life,test_size=0.2,random_state=0)

In [None]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)

<p id='6'><b><h3>Machine Learning</h3></b>

<p>As a result of our initial evaluations, we have used a number of artificial learning algorithms. These are logistic regression, support vector machine (SVM), k close neighborhood (kNN), GradientBoostingClassifier and RandomForestClassifier algorithms. The first algorithm is logistic regression algorithm. To implement this algorithm model, we need to separate dependent and independent variables within our data sets. In addition, we created a combination of features between different features to make different experiments. While creating these parameters, the process of finding the best results was made by giving hyper parameter values.</p>

In [None]:
def plot_roc_(false_positive_rate,true_positive_rate,roc_auc):
    plt.figure(figsize=(5,5))
    plt.title('Receiver Operating Characteristic')
    plt.plot(false_positive_rate,true_positive_rate, color='red',label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],linestyle='--')
    plt.axis('tight')
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()
    
def plot_feature_importances(gbm):
    n_features = X_train.shape[1]
    plt.barh(range(n_features), gbm.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), X_train.columns)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
    plt.ylim(-1, n_features)

In [None]:
reduced_data_train = pd.DataFrame(X_train, columns=['Dim1', 'Dim2','Dim3','Dim4','Dim5','Dim6','Dim7','Dim8','Dim9'])
reduced_data_test = pd.DataFrame(X_test, columns=['Dim1', 'Dim2','Dim3','Dim4','Dim5','Dim6','Dim7','Dim8','Dim9'])
X_train=reduced_data_train
X_test=reduced_data_test

In [None]:
combine_features_list=[
    ('Dim1','Dim2','Dim3'),
    ('Dim4','Dim5','Dim5','Dim6'),
    ('Dim7','Dim8','Dim1'),
    ('Dim4','Dim8','Dim5','Dim9')
]

<p id='7'><b><h3>Logistic Regression</h3></b>
<p>First we need parameters to use our data more effectively. Hyperthermatic technique was used for this condition. This technique is used to express different features in the process.</p>

In [None]:
parameters=[
{
    'penalty':['l1','l2'],
    'C':[0.1,0.4,0.5],
    'random_state':[0]
    },
]

for features in combine_features_list:
    print(features)
    print("*"*50)
    
    X_train_set=X_train.loc[:,features]
    X_test_set=X_test.loc[:,features]
    
    gslog=GridSearchCV(LogisticRegression(),parameters,scoring='accuracy')
    gslog.fit(X_train_set,y_train)
    print('Best parameters set:')
    print(gslog.best_params_)
    print()
    predictions=[
    (gslog.predict(X_train_set),y_train,'Train'),
    (gslog.predict(X_test_set),y_test,'Test'),
    ]
    
    for pred in predictions:
        print(pred[2] + ' Classification Report:')
        print("*"*50)
        print(classification_report(pred[1],pred[0]))
        print("*"*50)
        print(pred[2] + ' Confusion Matrix:')
        print(confusion_matrix(pred[1], pred[0]))
        print("*"*50)

    print("*"*50)    
    basari=cross_val_score(estimator=LogisticRegression(),X=X_train,y=y_train,cv=12)
    print(basari.mean())
    print(basari.std())
    print("*"*50) 
   

In [None]:
from sklearn.linear_model import LogisticRegression

lr=LogisticRegression(C=0.1,penalty='l1',random_state=0)
lr.fit(X_train,y_train)

y_pred=lr.predict(X_test)


y_proba=lr.predict_proba(X_test)

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test,y_proba[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)
plot_roc_(false_positive_rate,true_positive_rate,roc_auc)


from sklearn.metrics import r2_score,accuracy_score

#print('Hata Oranı :',r2_score(y_test,y_pred))
print('Accurancy Oranı :',accuracy_score(y_test, y_pred))
print("Logistic TRAIN score with ",format(lr.score(X_train, y_train)))
print("Logistic TEST score with ",format(lr.score(X_test, y_test)))
print()

cm=confusion_matrix(y_test,y_pred)
print(cm)
sns.heatmap(cm,annot=True)
plt.show()

<p id='8'><b><h3>K-Nearest Neighbors</h3></b>

In [None]:
parameters=[
{
    'n_neighbors':np.arange(2,33),
    'n_jobs':[2,6]
    },
]
print("*"*50)
for features in combine_features_list:
    print("*"*50)
    
    X_train_set=X_train.loc[:,features]
    X_test_set=X_test.loc[:,features]
   
    gsknn=GridSearchCV(KNeighborsClassifier(),parameters,scoring='accuracy')
    gsknn.fit(X_train_set,y_train)
    print('Best parameters set:')
    print(gsknn.best_params_)
    print("*"*50)
    predictions = [
    (gsknn.predict(X_train_set), y_train, 'Train'),
    (gsknn.predict(X_test_set), y_test, 'Test1')
    ]
    for pred in predictions:
        print(pred[2] + ' Classification Report:')
        print("*"*50)
        print(classification_report(pred[1], pred[0]))
        print("*"*50)
        print(pred[2] + ' Confusion Matrix:')
        print(confusion_matrix(pred[1], pred[0]))
        print("*"*50)
        
    print("*"*50)    
    basari=cross_val_score(estimator=KNeighborsClassifier(),X=X_train,y=y_train,cv=12)
    print(basari.mean())
    print(basari.std())
    print("*"*50)

In [None]:
knn=KNeighborsClassifier(n_jobs=2, n_neighbors=22)
knn.fit(X_train,y_train)

y_pred=knn.predict(X_test)

y_proba=knn.predict_proba(X_test)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test,y_proba[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)
plot_roc_(false_positive_rate,true_positive_rate,roc_auc)

from sklearn.metrics import r2_score,accuracy_score

print('Accurancy Oranı :',accuracy_score(y_test, y_pred))
print("KNN TRAIN score with ",format(knn.score(X_train, y_train)))
print("KNN TEST score with ",format(knn.score(X_test, y_test)))
print()

cm=confusion_matrix(y_test,y_pred)
print(cm)
sns.heatmap(cm,annot=True)
plt.show()

In [None]:
n_neighbors = range(1, 17)
train_data_accuracy = []
test1_data_accuracy = []
for n_neigh in n_neighbors:
    knn = KNeighborsClassifier(n_neighbors=n_neigh,n_jobs=5)
    knn.fit(X_train, y_train)
    train_data_accuracy.append(knn.score(X_train, y_train))
    test1_data_accuracy.append(knn.score(X_test, y_test))
plt.plot(n_neighbors, train_data_accuracy, label="Train Data Set")
plt.plot(n_neighbors, test1_data_accuracy, label="Test1 Data Set")
plt.ylabel("Accuracy")
plt.xlabel("Neighbors")
plt.legend()
plt.show()

In [None]:
n_neighbors = range(1, 17)
k_scores=[]
for n_neigh in n_neighbors:
    knn = KNeighborsClassifier(n_neighbors=n_neigh,n_jobs=5)
    scores=cross_val_score(estimator=knn,X=X_train,y=y_train,cv=12)
    k_scores.append(scores.mean())
print(k_scores)

In [None]:
plt.plot(n_neighbors,k_scores)
plt.xlabel('Value of k for KNN')
plt.ylabel("Cross-Validated Accurancy")
plt.show()

<p id='9'><b><h3>Naive Baes</h3></b>

In [None]:
parameters = [
    {
        'kernel': ['linear'],
        'random_state': [2]
    },
    {
        'kernel': ['rbf'],
        'gamma':[0.9,0.06,0.3],
        'random_state': [0],
        'C':[1,2,3,4,5,6],
        'degree':[2],
        'probability':[True]
    },
]

for features in combine_features_list:
    print("*"*50)
    X_train_set=X_train.loc[:,features]
    X_test_set=X_test.loc[:,features]
  
    svc = GridSearchCV(SVC(), parameters,
    scoring='accuracy')
    svc.fit(X_train_set, y_train)
    print('Best parameters set:')
    print(svc.best_params_)
    print("*"*50)
    predictions = [
    (svc.predict(X_train_set), y_train, 'Train'),
    (svc.predict(X_test_set), y_test, 'Test1')
    ]
    for pred in predictions:
        print(pred[2] + ' Classification Report:')
        print("*"*50)
        print(classification_report(pred[1], pred[0]))
        print("*"*50)
        print(pred[2] + ' Confusion Matrix:')
        print(confusion_matrix(pred[1], pred[0]))
        print("*"*50)
        
    print("*"*50)    
    basari=cross_val_score(estimator=SVC(),X=X_train,y=y_train,cv=4)
    print(basari.mean())
    print(basari.std())
    print("*"*50)

In [None]:
svc=SVC(C=5,degree=2,gamma=0.06,kernel='rbf',probability=True,random_state=0)
svc.fit(X_train,y_train)

y_pred=svc.predict(X_test)

y_proba=svc.predict_proba(X_test)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test,y_proba[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)
plot_roc_(false_positive_rate,true_positive_rate,roc_auc)

from sklearn.metrics import r2_score,accuracy_score

print('Accurancy Oranı :',accuracy_score(y_test, y_pred))
print("SVC TRAIN score with ",format(svc.score(X_train, y_train)))
print("SVC TEST score with ",format(svc.score(X_test, y_test)))
print()

cm=confusion_matrix(y_test,y_pred)
print(cm)
sns.heatmap(cm,annot=True)
plt.show()

<p id='10'><b><h3>Gradient Boosting Machine</h3></b>

In [None]:
parameters = [
{
    'learning_rate': [0.01, 0.02, 0.002],
    'random_state': [0],
    'n_estimators': np.arange(3, 20)
    },
]
for features in combine_features_list:
    print("*"*50)
    X_train_set=X_train.loc[:,features]
    X_test1_set=X_test.loc[:,features]
   
    gbc = GridSearchCV(GradientBoostingClassifier(), parameters, scoring='accuracy')
    gbc.fit(X_train_set, y_train)
    print('Best parameters set:')
    print(gbc.best_params_)
    print("*"*50)
    predictions = [
    (gbc.predict(X_train_set), y_train, 'Train'),
    (gbc.predict(X_test1_set), y_test, 'Test1')
    ]
    for pred in predictions:
        print(pred[2] + ' Classification Report:')
        print("*"*50)
        print(classification_report(pred[1], pred[0]))
        print("*"*50)
        print(pred[2] + ' Confusion Matrix:')
        print(confusion_matrix(pred[1], pred[0]))
        print("*"*50)
        
    print("*"*50)    
    basari=cross_val_score(estimator=GradientBoostingClassifier(),X=X_train,y=y_train,cv=4)
    print(basari.mean())
    print(basari.std())
    print("*"*50)

In [None]:
gbc=GradientBoostingClassifier(learning_rate=0.02,n_estimators=18,random_state=0)
gbc.fit(X_train,y_train)

y_pred=gbc.predict(X_test)

y_proba=gbc.predict_proba(X_test)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test,y_proba[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)
plot_roc_(false_positive_rate,true_positive_rate,roc_auc)

from sklearn.metrics import r2_score,accuracy_score

print('Accurancy Oranı :',accuracy_score(y_test, y_pred))
print("GradientBoostingClassifier TRAIN score with ",format(gbc.score(X_train, y_train)))
print("GradientBoostingClassifier TEST score with ",format(gbc.score(X_test, y_test)))
print()

cm=confusion_matrix(y_test,y_pred)
print(cm)
sns.heatmap(cm,annot=True)
plt.show()

In [None]:
plot_feature_importances(gbc)
plt.show()

<p id='11'><b><h3>Random Forest</h1></b></p>	


In [None]:
parameters = [
    {
        'max_depth': np.arange(1, 10),
        'min_samples_split': np.arange(2, 5),
        'random_state': [3],
        'n_estimators': np.arange(10, 20)
    },
]

for features in combine_features_list:
    print("*"*50)
    
    X_train_set=X_train.loc[:,features]
    X_test1_set=X_test.loc[:,features]
    
    tree=GridSearchCV(RandomForestClassifier(),parameters,scoring='accuracy')
    tree.fit(X_train_set, y_train)
    
    print('Best parameters set:')
    print(tree.best_params_)
    print("*"*50)
    predictions = [
        (tree.predict(X_train_set), y_train, 'Train'),
        (tree.predict(X_test1_set), y_test, 'Test1')
    ]
    
    for pred in predictions:
        
        print(pred[2] + ' Classification Report:')
        print("*"*50)
        print(classification_report(pred[1], pred[0]))
        print("*"*50)
        print(pred[2] + ' Confusion Matrix:')
        print(confusion_matrix(pred[1], pred[0]))
        print("*"*50)
    
    print("*"*50)    
    basari=cross_val_score(estimator=RandomForestClassifier(),X=X_train,y=y_train,cv=4)
    print(basari.mean())
    print(basari.std())
    print("*"*50)

In [None]:
rfc=RandomForestClassifier(max_depth=7,min_samples_split=4,n_estimators=19,random_state=3)
rfc.fit(X_train,y_train)

y_pred=rfc.predict(X_test)

y_proba=rfc.predict_proba(X_test)
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test,y_proba[:,1])
roc_auc = auc(false_positive_rate, true_positive_rate)
plot_roc_(false_positive_rate,true_positive_rate,roc_auc)

from sklearn.metrics import r2_score,accuracy_score
print('Accurancy Oranı :',accuracy_score(y_test, y_pred))
print("RandomForestClassifier TRAIN score with ",format(rfc.score(X_train, y_train)))
print("RandomForestClassifier TEST score with ",format(rfc.score(X_test, y_test)))
print()

cm=confusion_matrix(y_test,y_pred)
print(cm)
sns.heatmap(cm,annot=True)
plt.show()

In [None]:
for i in range(1,11):
    rf = RandomForestClassifier(n_estimators=i, random_state = 3, max_depth=7)
    rf.fit(X_train, y_train)
    print("TEST set score w/ " +str(i)+" estimators: {:.5}".format(rf.score(X_test, y_test)))

In [None]:
plot_feature_importances(rf)
plt.show()

<p id='12'><b><h3>Decision Tree</h3></b></p>

In [None]:
parameters = [
{
    'random_state': [42],
    },
]
for features in combine_features_list:
    print("*"*50)
    X_train_set=X_train.loc[:,features]
    X_test1_set=X_test.loc[:,features]
    
    dtr = GridSearchCV(DecisionTreeClassifier(), parameters, scoring='accuracy')
    
    dtr.fit(X_train_set, y_train)
    print('Best parameters set:')
    print(dtr.best_params_)
    print("*"*50)
    predictions = [
    (dtr.predict(X_train_set), y_train, 'Train'),
    (dtr.predict(X_test1_set), y_test, 'Test1')
    ]
    for pred in predictions:
        print(pred[2] + ' Classification Report:')
        print("*"*50)
        print(classification_report(pred[1], pred[0]))
        print("*"*50)
        print(pred[2] + ' Confusion Matrix:')
        print(confusion_matrix(pred[1], pred[0]))
        print("*"*50)
        
    print("*"*50)    
    basari=cross_val_score(estimator=DecisionTreeClassifier(),X=X_train,y=y_train,cv=4)
    print(basari.mean())
    print(basari.std())
    print("*"*50)  

<p id='13'><b><h3>Kernelized SVM</h3></b></p>

In [None]:
parameters = [
{
    'random_state': [42],
    },
]
for features in combine_features_list:
    print("*"*50)
    X_train_set=X_train.loc[:,features]
    X_test1_set=X_test.loc[:,features]
    
    dtr = GridSearchCV(SVC(), parameters, scoring='accuracy')
    
    dtr.fit(X_train_set, y_train)
    print('Best parameters set:')
    print(dtr.best_params_)
    print("*"*50)
    predictions = [
    (dtr.predict(X_train_set), y_train, 'Train'),
    (dtr.predict(X_test1_set), y_test, 'Test1')
    ]
    for pred in predictions:
        print(pred[2] + ' Classification Report:')
        print("*"*50)
        print(classification_report(pred[1], pred[0]))
        print("*"*50)
        print(pred[2] + ' Confusion Matrix:')
        print(confusion_matrix(pred[1], pred[0]))
        print("*"*50)
        
    print("*"*50)    
    basari=cross_val_score(estimator=SVC(),X=X_train,y=y_train,cv=4)
    print(basari.mean())
    print(basari.std())
    print("*"*50)  