## What will you learn?
* EDA
* Missing Value Analysis
* Categoric and Numeric Features
* Standardization
* Box - Swarm - Cat - Correlation Plot Analysis
* Outlier Detection
* ML Modelling and Tuning

## Content

1. [Python Libraries](#1)
1. [Data Content](#2)
1. [Read and Analysis Data](#3)
1. [Unique Value Anlysis](#4)
1. [Categorical Feature Anlysis](#5)
1. [Numeric Feature Analysis](#6)
1. [Standardization](#7)
1. [Box Plot Analysis](#8)
1. [Swarm Plot Anlysis](#9)
1. [Cat Plot Analysis](#10)
1. [Correlation Anlysis](#11)
1. [Outlier Detection](#12)
1. [Modelleme](#13)
    1. Encoding Categorical Columns
    1. Scaling
    1. Train Test Split
    1. Logistic Regression
    1. Logistic Regression Hyper Parameter Tuning
1. [Conclusion](#14)

<a id="1"></a>
## Python Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.metrics import accuracy_score,roc_curve

import warnings
warnings.filterwarnings("ignore")

 

<a id="2"></a>
## Data Content
* Age: Hastanın yaşı
* Sex: Hastanın cinsiyeti
* exang: Egzersiz sonrası ağrı oluşup oluşmamsı (1=yes 0=no)
* ca: Büyük damar sayısı
* cp: Göğüs ağrıları türleri
* trtbps: Dinlenirken kan basıncının ne olduğu
* chol: Kolesterol
* fbs: Açken kan şekerinin ne olduğu(1=true 0=false)
* rest_ecg: dinlenirken ekg sonucu
* thalach: Ulaşılan maximum kalp atış hızı
* target: Kalp geçirme olasılığı (1=geçirir 0=geçirmez)

<a id="3"></a>
## Read and Analysis Data

In [None]:
df=pd.read_csv("/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv")
df.head()

In [None]:
df.describe().T

In [None]:
df.info()

In [None]:
df.isnull().sum()

<a id="4"></a>
## Unique Value Anlysis

In [None]:
df["sex"].value_counts()

In [None]:
df.columns

In [None]:
for i in list(df.columns):
    print("{} -- {}".format(i, df[i].value_counts().shape[0]))

<a id="5"></a>
## Categorical Feature Anlysis

In [None]:
categorical_list=["sex","cp","restecg","fbs","exng","thall","caa","output"]

In [None]:
df.loc[:,categorical_list]

In [None]:
df_categoric=df.loc[:,categorical_list]
for i in categorical_list:
    plt.figure()
    sns.countplot(x=i,data=df_categoric,hue="output")
    plt.title(i)

<a id="6"></a>
## Numeric Feature Analysis

In [None]:
numeric_list=["age","trtbps","chol","thalachh","oldpeak","output"]

In [None]:
df_numeric=df.loc[:,numeric_list]
sns.pairplot(data=df_numeric,hue="output",diag_kind="kde")
plt.show()

<a id="7"></a>
## Standardization

In [None]:
scaler=StandardScaler()

In [None]:
scaled_array=scaler.fit_transform(df[numeric_list[:-1]])

In [None]:
scaled_array

In [None]:
pd.DataFrame(scaled_array).describe()

<a id="8"></a>
## Box Plot Analysis

In [None]:
#Box plot analizi yaparken standardizasyon işlemi yapılmalı
df_dummy=pd.DataFrame(scaled_array,columns=numeric_list[:-1])
df_dummy.head()

In [None]:
df_dummy=pd.concat([df_dummy, df.loc[:,"output"]],axis=1)
df_dummy.head()

In [None]:
data_melted=pd.melt(df_dummy, id_vars="output", var_name="features", value_name="value")
data_melted.head()

In [None]:
plt.figure()
sns.boxplot(x="features",y="value",hue="output",data=data_melted)
plt.show()

<a id="9"></a>
## Swarm Plot Anlysis

In [None]:
plt.figure()
sns.swarmplot(x="features",y="value",hue="output",data=data_melted)
plt.show()

<a id="10"></a>
## Cat Plot Analysis


In [None]:
#Diğer Featureler Ödev

In [None]:
sns.catplot(x="exng",y="age",hue="output",col="sex",kind="swarm",data=df)
plt.show()

In [None]:
sns.catplot(x="fbs",y="age",hue="output",col="sex",kind="swarm",data=df)
plt.show()

<a id="11"></a>
## Correlation Anlysis

In [None]:
df.corr()

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df.corr(),annot=True, fmt=".1f",linewidths=.7)
plt.show()

<a id="12"></a>
## Outlier Detection

In [None]:
# Makchine Leraning procesisini bozan bir durumdur

In [None]:
numeric_list=["age","trtbps","chol","thalachh","oldpeak"]
df_numeric=df.loc[:,numeric_list]
df_numeric.head()

In [None]:
df.describe().T

In [None]:
#Outlier Detection
for i in numeric_list:
    Q1=np.percentile(df.loc[:, i],25)
    Q3=np.percentile(df.loc[:, i],75)
    IQR=Q3-Q1
    print("Old Shape:",df.loc[:, i].shape)
    
    # upper
    upper=np.where(df.loc[:, i]>=(Q3 + 2.5*IQR))
    
    #LOWER
    
    lower=np.where(df.loc[:, i]<=(Q3 - 2.5*IQR))
    
    print("{} -- {}".format(upper,lower))
    
    try:
        df.drop(upper[0],inplace=True)
    except: print("KeyError:{} not found in axis".format(upper[0]))
    
    try:
        df.drop(lower[0],inplace=True)
    except:print("KeyError:{} not found in axis".format(lower[0]))
        
    print("New shape:",df.shape)
        

<a id="13"></a>
## Modelleme

In [None]:
df1=df.copy()

### Encoding Categorical Columns

In [None]:
df1.columns

In [None]:
categorical_list=['sex', 'cp', 'restecg', 'fbs', 'exng', 'thall', 'caa', 'slp',"output"]

In [None]:
df1=pd.get_dummies(df1,columns=categorical_list[:-1],drop_first=True)
df1.head()

In [None]:
df1.columns

In [None]:
X = df1.drop(["output"],axis=1)
y=df1[["output"]]

In [None]:
X.head()

In [None]:
y

In [None]:
numeric_list

### Scaling

In [None]:
scaler=StandardScaler()
scaler
X[numeric_list[:-1]]=scaler.fit_transform(X[numeric_list[:-1]])

In [None]:
X.head()

### Train Test Split

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.1,random_state=3)
print("X_train:{}".format(X_train.shape))
print("X_test:{}".format(X_test.shape))
print("y_train:{}".format(y_train.shape))
print("y_test:{}".format(y_test.shape))

### Logistic Regression

In [None]:
logreg=LogisticRegression()
logreg

In [None]:
logreg.get_params()

In [None]:
logreg.fit(X_train,y_train)

In [None]:
# CALCULATE PROBABİLİTİES
y_pred_prob=logreg.predict_proba(X_test)

In [None]:
y_pred_prob

In [None]:
y_pred=np.argmax(y_pred_prob,axis=1)
y_pred

In [None]:
dummy_=pd.DataFrame(y_pred_prob)
dummy_["y_pred"]=y_pred
dummy_.head()

In [None]:
print("Test Accuracy: {}".format(accuracy_score(y_test,y_pred)))

### Logistic Regression Hyperparameter Tuning

In [None]:
lr=LogisticRegression()
lr

In [None]:
penalty=["l1","l2"]
parameters={"penalty":penalty}
lr_searcher=GridSearchCV(lr,parameters)
lr_searcher.fit(X_train,y_train)

In [None]:
print("Best Parametreler:",lr_searcher.best_params_)

In [None]:
lr_tuned=LogisticRegression(penalty="l2")
lr_tuned.fit(X_train,y_train)

In [None]:
y_pred=lr_tuned.predict(X_test)

In [None]:
accuracy_score(y_test,y_pred)

<a id="14"></a>
## Conclusion

In [None]:
# Diğer algoritmaları kullan


## Random Forest 

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf=RandomForestClassifier()
rf

In [None]:
rf.fit(X_train,y_train)

In [None]:
y_pred=rf.predict(X_test)

In [None]:
accuracy_score(y_test,y_pred)

In [None]:
?rf

In [None]:
## Model Tuning
rf=RandomForestClassifier()
rf

In [None]:
rf_params={"max_depth":[2,3,5,7,8,10],
          "max_features":[2,5,8],
          "n_estimators":[10,500,1000],
          "min_samples_split":[2,5,10]}
rf_grid=GridSearchCV(rf,rf_params,cv=10,n_jobs=-1,verbose=2)
rf_grid.fit(X_train,y_train)

In [None]:
print("Best Parametreler:",rf_grid.best_params_)

In [None]:
rf_tuned=RandomForestClassifier(max_depth=7,max_features=2,min_samples_split=10,n_estimators=10)

In [None]:
rf_tuned

In [None]:
rf_tuned.fit(X_train,y_train)

In [None]:
y_pred=rf_tuned.predict(X_test)

In [None]:
accuracy_score(y_test,y_pred)

## Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

In [None]:
nb=GaussianNB()

In [None]:
nb.fit(X_train,y_train)

In [None]:
y_pred=nb.predict(X_test)

In [None]:
accuracy_score(y_test,y_pred)

In [None]:
?nb

##### Bu veri setinde en iyi çalışan algoritma Logistic Regression oldu.