In [None]:
from IPython.display import Image
import os
!ls ../input/

Image("../input/charts1/MachL.png")

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

<h1 style="color:purple;">Classification Algorithms</h1>

In Machine Learning, classification is a **supervised learning approach**, which can be thought of as a means of categorizing or "classifying" some unknown items into a discrete set of "classes."Classification attempts to learn the relationship between a set of feature variables and a target variable of interest.<br>
The target attribute in classification is a categorical variable with discrete values.So, how does classification and classifiers work?
Given a set of training data points, along with the target labels, classification determines the class label for an unlabeled test case.<br>
Let’s explain this with an example.<br>
A good sample of classification is the **loan default prediction** .Suppose a bank is concerned about the potential for loans not to be repaid.If previous loan default data can be used to predict which customers are likely to have problems repaying loans, these "bad risk" customers can either have their loan application declined or offered alternative products.<br><br>

The goal of a loan default predictor is to use existing loan default data, which is information about the customers (such as age, income, education, etc.), to build a classifier, pass a new customer or potential future defaulter to the model, and then label it (i.e. the
data points) as "Defaulter" or "Not Defaulter", or for example, 0 or 1.This is how a classifier predicts an unlabeled test case.<br>
**Please notice that this specific example was about a binary classifier with two values.** <br>
We can also build classifier models for both binary classification and multi-class classification.For example, imagine that you collected data about a set of patients, all of whom suffered from the same illness.During their course of treatment, each patient responded to one of three medications.You can use this labeled dataset, with a classification algorithm, to build a classification model.
Then you can use it to find out which drug might be appropriate for a future patient with the same illness.<br>
As you can see, it is a sample of multi-class classification.<br><br>

Classification has different business use cases as well, for example: <br>
To predict the category to which a customer belongs; For Churn detection, where we predict whether a customer switches to another provider or brand; Or to predict whether or not a customer responds to a particular advertising campaign.
Data classification has several applications in a wide variety of industries.Essentially, many problems can be expressed as associations between feature and target variables, especially when labeled data is available.This provides a broad range of applicability for classification.<br><br>
For example, classification can be used for 
* email filtering
* speech recognition
* handwriting
* recognition
* bio-metric identification
* document classification, and much more.<br>
Here we have the types of classification algorithms in machine learning.<br><br>
<ul>
    <li><a href="#knn" style="color:violet;">1. K-Nearest Neighbors Algorithm</a></li>
    <li><a href="#svm" style="color:violet;" >2. Support Machine Vector Algorithm & Naive Bayes</a></li>
    <li><a href="#dec" style="color:violet;">3. Decision Tree Classification</a></li>
    <li><a href="#rand" style="color:violet;">4. Random Forest Classification</a></li>
    <li><a href="#eval" style="color:violet;">5. Evaluation Of Classification Models</a></li>
    <li><a href="#kmeans" style="color:violet;">6. K-Means Clustering </a></li>
    <li><a href="#hier" style="color:violet;">7. Hierarchial Clustering </a></li>
    <li><a href="#pca" style="color:violet;">8. Principle Component Analysis ( PCA ) </a></li>
    <li><a href="#kfold" style="color:violet;">9. Model Selection & K-Fold Cross Validation ,Grid Search Cross Validation</a></li>
    
    
</ul>



<h1 style="color:purple;" id="knn"> 1. K-Nearest Neighbors Algorithm </h1> 
<p>
 Imagine that a telecommunications provider has segmented its customer base by service usage patterns, categorizing the customers into four groups.If demographic data can be used to predict group membership, the company can customize offers for individual prospective customers.This is a classification problem.<br>
That is, given the dataset, with predefined labels, we need to build a model to be used to predict the class of a new or unknown case.<br>
The example focuses on using demographic data, such as region, age, and marital status, to predict usage patterns.
The target field, called custcat, has four possible values that correspond to the four customer groups, as follows: Basic Service, E-Service, Plus Service, and Total Service.<br>
Our objective is to build a classifier, for example using the rows 0 to 7, to predict
the class of row 8.<br><br>
    

Now, let’s define the k-nearest neighbors.The k-nearest-neighbors algorithm is a classification algorithm that takes a bunch of labelled points and uses them to learn how to label other points.<br>
This algorithm classifies cases based on their similarity to other cases.In k-nearest neighbors, data points that are near each other are said to be “neighbors.”<br>
K-nearest neighbors is based on this paradigm: “Similar cases with the same class labels are near each other.”Thus, the distance between two cases is a measure of their dissimilarity.There are different ways to calculate the similarity, or conversely, the distance or
dissimilarity of two data points.
For example, this can be done using **Euclidian distance**.<br>
Now, let’s see how the k-nearest neighbors algorithm actually works.In a classification problem, the k-nearest neighbors algorithm works as follows:<br>
<p style="color:purple;" >
1. Pick a value for K. <br>
2. Calculate the distance from the new case (holdout from each of the cases in the dataset). <br>
3. Search for the K observations in the training data that are ‘nearest’ to the measurements of the unknown data point. <br>
4. Predict the response of the unknown data point using the most popular response value from the K nearest neighbors. <br>
</p>
</p>

In [None]:
#K nearest neighbour algorithm
#First we are going to observe our dataset,
data=pd.read_csv("../input/classification/data.csv")
#drop unnecessary columns from our dataset
data.drop(['id','Unnamed: 32'],axis=1,inplace=True)
data.head()


In [None]:
#As it can be seen from the dataset in diagnosis column we have two types of label,
#melignant =M -->Bad Tumor Type
#benign =B -->Good Tumor Type

M=data[ data['diagnosis']=='M' ]
B=data[ data['diagnosis']=='B' ]

#scatter plot of radius mean-texture_means

plt.scatter( M.radius_mean,M.texture_mean,color="red",label="bad" )
plt.scatter( B.radius_mean,B.texture_mean,color='green',label='Good')

plt.xlabel('radius mean')
plt.ylabel('texture mean')
plt.legend()
plt.show()

In [None]:
#we should convet our data type to integer,so let's make list comprehension

data.diagnosis=[ 1 if each=='M' else 0 for each in data.diagnosis ]

#dependent variable is diagnosis column
y=data.diagnosis.values
#and the rest of the data is called independents,x s ,features that
#affects the dependent variable y,diagnosis
x_data=data.drop(['diagnosis'],axis=1) #features

#normalization
#we should make normalization because of the anormal differences between 
#our values in the columns of the dataset

x=(x_data-np.min(x_data))/(np.max(x_data)-np.min(x_data))

#train and test splitting
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=42)

#knn model,import the required library
from sklearn.neighbors import KNeighborsClassifier

knn=KNeighborsClassifier( n_neighbors=8 ) #n_neighbors=k
knn.fit( x_train,y_train ) #train our model
prediction=knn.predict(x_test) #test our model

print("{} nn score: {}".format(8,knn.score(x_test,y_test))) #accuracy

In [None]:
#find k value

score_list=[]

#let's try different numbers of n_neighbors and see the 
#changing results according to it,here we'll try
#the numbers between 1-15,and append the results 
#to the array,finally we'll plot it to see better

for each in range(1,15):
    knn2=KNeighborsClassifier(n_neighbors=each)
    knn2.fit(x_train,y_train)
    current_score=knn2.score( x_test,y_test )
    score_list.append( current_score )
plt.plot( range(1,15),score_list )
plt.xlabel("k values")
plt.ylabel("accuracy")

#when  knn has the highest value k takes the value of 8

In [None]:
#Here I can say that if we choose 4 it'll give us best 
#result,since it hast the highest accuracy value in graph

#our previous accuracy is nearly 0.96 means 96%
#now I'm gonna change it to the 4 and we'll see better 
#accuracy results

knn3=KNeighborsClassifier(n_neighbors=4)
knn3.fit(x_train,y_train)
current_score=knn3.score( x_test,y_test )

current_score
#see it's better,so try and find the best!


<h1 style="color:purple;" id="svm"> 2. Support Machine Vector Algorithm (SVM )</h1> 


SVM is a supervised algorithm that classifies cases by finding a separator.
<p style="color:purple;" >
1.Mapping data to a **high-dimensional** feature space so that dataset can be categorized even when the data otherwise linearly separable.<br>
 2.Then a separator is the estimator for the data.( Finding a separator )
</p>

The data points will fall into two different categories and it should'nt be represent a separable non-linear dataset otherwise these two categories can be separated by a curve not a line.And data can be in the three-dimensional space then we should use hyperplanes as a separator.<br><br>

**Data Transformation**<br>

If our data point are not linearly separable,for example a simple straightforward line(1 dimensional-1 feature) then we must transform our data.For example, [x,x'2] ,and now since x^2( two dimensional-2 feature) is a curve we can separate the data point more easily.<br>

Mapping data into a higher space is called Kernelling.( types are Linear,Polynomial,RBF,Sigmoid )<br><br>

How do we find the right and optimized separator after transformation ? <br>

SVM s are based on the idea of finding a hyperplane that best divides the dataset into two classes,you can think of the hyperplane as a line that linearly separates these two classes.<br>
One reasnonable choice is the best hyperplane is the one that represents the largest separation or margin between the two classes.
<br><br>

So the goal is to choose a hyperplane with as big margin as possible.Examples closest to the hyperplane are support vectors.Only support vectors are enough to achive our goal and other training examples can be ignored.We try to find the hyperplane in such a way that it has the maximum distance to support vectors.<br><br>
<p style="color:purple;" >
Support Vector 1 --> w.T+b=1<br>
Hyperplane --> w.T+b=0<br>
Support Vector 2 --> w.T+b=-1<br>
</p>
The output of the algorithm is the parameters w and b for the line of hyperplane.You can make classificaitons using this estimated line.<br>
<br>
After test,if the equation returns a value greater than 0 then the data point belongs to first class which is above the line and vice-versa,<br><br>

**Advantages:**<br>

* Accurate in high-dimensional spaces
* Memory efficient<br>

**Disadvntages:**<br>

* Prone to over-fitting( if the number of features is greater than the number of samples )
* No probability estimation
* Small datasets <br>

**When to use**: <br>

* Image recognition
* text category assinment
* detecting spam
* sentiment analysis
* gene expression classification
* regressionioutlier detection and clustering









In [None]:
#again train and test splitting
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)
#by chaning the size of test and size data we can change accuracy val.
from sklearn.svm import SVC
from warnings import simplefilter

# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

svm=SVC( random_state=1 )
svm.fit( x_train,y_train )

print('accuracy of svm algorithm: ',svm.score( x_test,y_test ))


In [None]:
from sklearn.naive_bayes import GaussianNB

nb=GaussianNB()
nb.fit(x_train,y_train)

print('accuracy of naive bayes algorithm: ',svm.score( x_test,y_test ))

In [None]:
from IPython.display import Image
import os
!ls ../input/

Image("../input/charts1/chart.png")

<h1 style="color:purple;" id="dec"> 3. Decision Tree Classification</h1> <br>

A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This flowchart-like structure helps you in decision making. It's visualization like a flowchart diagram which easily mimics the human level thinking. That is why decision trees are easy to understand and interpret.<br><br>

Decision Tree is a white box type of ML algorithm. It shares internal decision-making logic, which is not available in the black box type of algorithms such as Neural Network. Its training time is faster compared to the neural network algorithm. The time complexity of decision trees is a function of the number of records and number of attributes in the given data. The decision tree is a distribution-free or non-parametric method, which does not depend upon probability distribution assumptions. Decision trees can handle high dimensional data with good accuracy.<br><br>

How does the Decision Tree algorithm work?<br><br>

The basic idea behind any decision tree algorithm is as follows:<br>

Select the best attribute using Attribute Selection Measures(ASM) to split the records.
Make that attribute a decision node and breaks the dataset into smaller subsets.
Starts tree building by repeating this process recursively for each child until one of the condition will match:<br>

* All the tuples belong to the same attribute value.
* There are no more remaining attributes.
* There are no more instances.

In [None]:
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

In [None]:

# load dataset
pima = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
pima['Pregnancies']=pima['Pregnancies'].astype('float')
pima.head()


In [None]:
#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']

y = pima.Outcome # Target variable
X = pima.drop(['Outcome'],axis=1) # Features


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

In [None]:
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
from IPython.display import Image
import os
!ls ../input/

Image("../input/charts1/chart2.png")

<h1 style="color:purple;" id="rand"> 4. Random Forest Classification</h1> <br>

**Ensemble Methods**<br>
Ensemble methods are algorithms that combine multiple algorithms into a single predictive model in order to decrease variance, decrease bias, or improve predictions.<br>

Ensemble methods are usually broken into two categories:<br>

Parallel: An ensemble method where the models that make up the building blocks of the larger methods are generated independent of each other (i.e., they can be trained/generated as trivially parallel problems applied to the dataset).<br>
Sequential: An ensemble methods where the learners are generated in a sequential order and are dependent on each other (i.e., they can only be trained one at a time, as the next model will require information from the training upstream of it).<br>
The random forest algorithm relies on a parallel ensemble method called "bagging" to generate its weak classifiers.<br>

**Bagging**<br>
Bagging is a colloquial term for bootstrap aggregation. Bootstrap aggregation is a method that allows us to decrease the variance of an estimate by averaging multiple estimates that are measured from random subsamples of a population.<br>

**Bootstrap Sampling**<br>
The first portion of bagging is the application of bootstrap sampling to obtain subsets of the data. These subsets are then fed into one model that will comprise the final ensemble method. This is a straightforward process, given a set of observation data, n observations are selected at random and with replacement to form the subsample. This subsample is what is then fed into the machine learning algorithm of choice to train the model.<br>

**Aggregation**<br>
After all of the models have been built, their outputs must be aggregated into a single coherent prediction for the larger model. In the case of a classifier model, this is usually just a winners take all strategy—whichever category receives the most votes is the final outcome predicted. In the case of a regression problem, a simple average of predicted outcome values is used.<br>

**Feature Bagging**<br>
Feature bagging (or the random subspace method) is a type of ensemble method that is applied to the features (columns) of a dataset instead of to the observations (rows). It is used as a method of reducing the correlation between features by training base predictors on random subsets of features instead of the complete feature space each time.<br>

**The Random Forest** <br>
Based on what was previously covered in decision trees and ensemble methods, it should come as little surprise as to where the random forest gets its name or how they’re constructed at a high-level, but let’s go over it anyways.

A random forest is comprised of a set of decision trees, each of which is trained on a random subset of the training data. These trees predictions can then be aggregated to provide a single prediction from a series of predictions.



In [None]:
from sklearn.tree     import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix


bc = load_breast_cancer()
X = bc.data
y = bc.target

# Create our test/train split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)


## build our models
decision_tree = DecisionTreeClassifier()
random_forest = RandomForestClassifier(n_estimators=100)

## Train the classifiers
decision_tree.fit(X_train, y_train)
random_forest.fit(X_train, y_train)

# Create Predictions
dt_pred = decision_tree.predict(X_test)
rf_pred = random_forest.predict(X_test)

# Check the performance of each model
print('Decision Tree Model')
print(classification_report(y_test, dt_pred, target_names=bc.target_names))

print('Random Forest Model')
print(classification_report(y_test, rf_pred, target_names=bc.target_names))

#Graph our confusion matrix

dt_cm = confusion_matrix(y_test, dt_pred)
rf_cm = confusion_matrix(y_test, rf_pred)

As you can see, we’re able to increase the number of correctly predicted benign tumors and decrease the number of benign tumors that are predicted as malignant. By using a random forest, we can more accurately predict the state of a tumor, potentially decreasing the amount of unneeded procedures performed on patients and decreasing patient stress about their diagnosis.<br>

At this point, you’d usually start investigating hyperparameter tuning. This is a crucial part of the modeling process in order to ensure that your model is optimal. 

<h1 style="color:purple;" id="eval"> 5. Evaluation Of Classification Models</h1> <br>

In [None]:
data=pd.read_csv('../input/classification/data.csv')

data.drop(['id','Unnamed: 32'],axis=1,inplace=True)

#diagnosis type cannot be object it must be categorical or integer 
#convert them into integer with list comprehension

data.diagnosis=[ 1 if each=='M' else 0 for each in data.diagnosis ]

y=data.diagnosis.values
x_data=data.drop(['diagnosis'],axis=1) #features

#normalization

import numpy as np
x=(x_data-np.min(x_data))/(np.max(x_data)-np.min(x_data))

#train and test data splitting

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.15,random_state=42)

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf=RandomForestClassifier( n_estimators=100,random_state=1 )
rf.fit(x_train,y_train)

print("Random Forest Classification score: ",rf.score(x_test,y_test))
#estimator how many trees inside of it
#which subsample will you use every time random_state indicates

y_pred=rf.predict(x_test)
y_true=y_test

In [None]:
from sklearn.metrics import confusion_matrix
cm=confusion_matrix(y_true,y_pred)


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

f,ax=plt.subplots(figsize=(5,5))
sns.heatmap(cm,annot=True,linewidth=0.5,linecolor='red',fmt='.0f',ax=ax)
plt.xlabel('y_pred')
plt.ylabel('y_true')
plt.show()

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known. The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing.<br>
<p  style="color:purple;" >
    <b>true positives (TP):</b> These are cases in which we predicted yes (they have the disease), and they do have the disease.<br>
    <b> true negatives (TN):</b> We predicted no, and they don't have the disease.<br>
    <b> false positives (FP):</b> We predicted yes, but they don't actually have the disease. (Also known as a "Type I error.")<br>
    <b>false negatives (FN):</b> We predicted no, but they actually do have the disease. (Also known as a "Type II error.")
</p>


**Accuracy:** <br>
Overall, how often is the classifier correct?<br>
(TP+TN)/total <br>
**Misclassification Rate:**<br>
Overall, how often is it wrong?<br>
(FP+FN)/total <br>
equivalent to 1 minus Accuracy also known as "Error Rate"<br>
**True Positive Rate:**<br> 
When it's actually yes, how often does it predict yes?<br>
TP/actual yes<br>
also known as "Sensitivity" or "Recall"<br>
**False Positive Rate:** <br>When it's actually no, how often does it predict yes?<br>
FP/actual no = 10/60 = 0.17<br>
**True Negative Rate:** <br>When it's actually no, how often does it predict no?<br>
TN/actual no = 50/60 = 0.83<br>
equivalent to 1 minus False Positive Rate<br>
also known as "Specificity"<br>
**Precision:** <br>When it predicts yes, how often is it correct?
TP/predicted yes = 100/110 = 0.91<br>
**Prevalence:**<br> How often does the yes condition actually occur in our sample?<br>
actual yes/total = 105/165 = 0.64<br><br>
A couple other terms are also worth mentioning:<br>

**Null Error Rate:** <br>This is how often you would be wrong if you always predicted the majority class. (In our example, the null error rate would be 60/165=0.36 because if you always predicted yes, you would only be wrong for the 60 "no" cases.) This can be a useful baseline metric to compare your classifier against. However, the best classifier for a particular application will sometimes have a higher error rate than the null error rate, as demonstrated by the Accuracy Paradox.<br>
**Cohen's Kappa:** This is essentially a measure of how well the classifier performed as compared to how well it would have performed simply by chance. In other words, a model will have a high Kappa score if there is a big difference between the accuracy and the null error rate. (More details about Cohen's Kappa.)<br>
**F Score:** This is a weighted average of the true positive rate (recall) and precision. (More details about the F Score.)<br>
**ROC Curve:** This is a commonly used graph that summarizes the performance of a classifier over all possible thresholds. It is generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis) as you vary the threshold for assigning observations to a given class. (More details about ROC Curves.)<br>

<h1 style="color:purple;" id="kmeans"> 6. K-means Clustering</h1> <br>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

#create dataset using gaussian variable
#class 1
x1=np.random.normal(25,5,1000) #avg=25,sigma=5,total points=1000 (25-30 arasında 1000 tane değer)
y1=np.random.normal(25,5,1000)

#create dataset using gaussian variable
#class 2
x2=np.random.normal(55,5,1000) #avg=25,sigma=5,total points=1000 (25-30 arasında 1000 tane değer)
y2=np.random.normal(60,5,1000)

#create dataset using gaussian variable
#class 3
x3=np.random.normal(55,5,1000) #avg=25,sigma=5,total points=1000 (25-30 arasında 1000 tane değer)
y3=np.random.normal(15,5,1000)

x=np.concatenate((x1,x2,x3),axis=0) #yukardan aşağı birleştirdik 3000 tane değer elde ettik
y=np.concatenate((y1,y2,y3),axis=0) #yukardan aşağı birleştirdik 3000 tane değer elde ettik
dictionary={"x":x,"y":y}

data=pd.DataFrame(dictionary)

plt.scatter(x1,y1,color='black') #we give color black to all cause it will be unsupervised learning
plt.scatter(x2,y2,color='black') #implementation,remove color='black to see classification'
plt.scatter(x3,y3,color='black')
plt.show()

In [None]:
data.head()#concatenated data

In [None]:
from sklearn.cluster import KMeans
wcss=[]
#to find the optimum value of k we try all k values in for loop
#according to the elbow rule we'll decide the k value
for k  in range(1,15):
    kmeans=KMeans(n_clusters=k)
    kmeans.fit(data)
    wcss.append(kmeans.inertia_)
    
plt.plot( range(1,15),wcss )
plt.xlabel('number of k(cluster value)')
plt.ylabel('wcss')
plt.show()
#most optimum k value is 3

**Choosing K** <br>
The algorithm described above finds the clusters and data set labels for a particular pre-chosen K. To find the number of clusters in the data, the user needs to run the K-means clustering algorithm for a range of K values and compare the results. In general, there is no method for determining exact value of K, but an accurate estimate can be obtained using the following techniques.<br>

One of the metrics that is commonly used to compare results across different values of K is the mean distance between data points and their cluster centroid. Since increasing the number of clusters will always reduce the distance to data points, increasing K will always decrease this metric, to the extreme of reaching zero when K is the same as the number of data points. Thus, this metric cannot be used as the sole target. Instead, mean distance to the centroid as a function of K is plotted and the "elbow point," where the rate of decrease sharply shifts, can be used to roughly determine K.<br>

A number of other techniques exist for validating K, including cross-validation, information criteria, the information theoretic jump method, the silhouette method, and the G-means algorithm. In addition, monitoring the distribution of data points across groups provides insight into how the algorithm is splitting the data for each K.

In [None]:
#so lest's choose k=3 and see the model

kmeans2=KMeans(n_clusters=3)
clusters=kmeans2.fit_predict(data)

print(clusters[:20])
#print(clusters[:50])
# we have in the labels 0,1 and 2 's iside of them
#it assigned some labels to each group of data inside of it

data["label"]=clusters #clusterları dataya ekliyoruz
#ekledim clusterları görsellestire
plt.scatter( data.x[data.label==0],data.y[data.label==0],color='red')
plt.scatter( data.x[data.label==1],data.y[data.label==1],color='blue')
plt.scatter( data.x[data.label==2],data.y[data.label==2],color='green')
#see successfully classified data,they are differentited from each other
#and let's see that centroids

#kmeans2.cluster_centers_ is a two dimensional array
plt.scatter(kmeans2.cluster_centers_[:,0],kmeans2.cluster_centers_[:,1],color='yellow')

<h1 style="color:purple;" id="hier"> 7. Hierarchial Clustering</h1> <br>

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

#create dataset using gaussian variable
#class 1
x1=np.random.normal(25,5,100) #avg=25,sigma=5,total points=1000 (25-30 arasında 1000 tane değer)
y1=np.random.normal(25,5,100)

#create dataset using gaussian variable
#class 2
x2=np.random.normal(55,5,100) #avg=25,sigma=5,total points=1000 (25-30 arasında 1000 tane değer)
y2=np.random.normal(60,5,100)

#create dataset using gaussian variable
#class 3
x3=np.random.normal(55,5,100) #avg=25,sigma=5,total points=1000 (25-30 arasında 1000 tane değer)
y3=np.random.normal(15,5,100)

x=np.concatenate((x1,x2,x3),axis=0) #yukardan aşağı birleştirdik 3000 tane değer elde ettik
y=np.concatenate((y1,y2,y3),axis=0) #yukardan aşağı birleştirdik 3000 tane değer elde ettik
dictionary={"x":x,"y":y}

data=pd.DataFrame(dictionary)

plt.scatter(x1,y1) #we give color black to all cause it will be unsupervised learning
plt.scatter(x2,y2) #implementation,remove color='black to see classification'
plt.scatter(x3,y3)
plt.show()

In [None]:
#we'll draw dendogram

from scipy.cluster.hierarchy import dendrogram, linkage

merg=linkage( data,method='ward') #clusterların içindeki varianceları küçültür,yayılımları minimize eder
dendrogram(merg,leaf_rotation=90)
plt.xlabel('data points')
plt.ylabel('euclidian distance')
plt.show()

<h1 style="color:purple;" id="hier"> 8. Principle Component Analysis (PCA)</h1> <br>

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.
Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier to explore and visualize and make analyzing data much easier and faster for machine learning algorithms without extraneous variables to process.
So to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while preserving as much information as possible.

In [None]:
from sklearn.datasets import load_iris

iris=load_iris()
#convert it to dat frame
data=iris.data # numpy array
feature_names=iris.feature_names
y=iris.target

df=pd.DataFrame(data,columns=feature_names)
df['class']=y #0-1-2 sınıflarımız var

x=data
df.head()


In [None]:
from sklearn.decomposition import PCA
#datamızın featurelarını azaltmaya çalışıyoruz
#reduce features into 2,normalize=whitten
pca=PCA( n_components=2,whiten=True )
pca.fit(x)

#boyutu düsürcek modeli ettik,matemaksiksel hesaplamaları yaptık
x_pca=pca.transform(x)
#uygulamak için trnsform etmeliyiz
print('variance ratio: ',pca.explained_variance_ratio_)
print('sum: ',sum(pca.explained_variance_ratio_))
# %97(sum) sini datanın hala kaybetmedik

In [None]:
#pca ile 2d görselleştirme yapacağız
df['p1']=x_pca[:,0]
df['p2']=x_pca[:,1]
#p1 ve p2 bizim reduction sonucunda elde ettiğimiz featurelar 
#bunları dataframe e ekliyoruz

color=["red","green","blue"]

import matplotlib.pyplot as plt

for each in range(3):
    plt.scatter(df[ df['class']==each ].p1,df[ df['class']==each ].p2,color=color[each],label=iris.target_names[each] )
plt.legend()
plt.show()
#versicolor ve virginica arasında biraz karışma var ama yinede iyi şekilde
#birbirlerinden ayrılmışalr,featurelar azaltınca veri kaybı yaşamışız anlamına gelir

<h1 style="color:purple;" id="hier"> 9. Model Selection & K-Fold Cross Validation</h1> <br>

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
import pandas as pd
import numpy as np

iris=load_iris()
#convert it to dat frame
x=iris.data
y=iris.target
#normalization
x=( x-np.min(x))/(np.max(x)-np.min(x))
#train and test
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)

#knn model
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=3)
#en yakın 3 komşuya bakıyoruz
#3 tane accuracy değeri buluyoruz
accuracies=cross_val_score( estimator=knn,X=x_train, y=y_train,cv=10 )
#train datamızı 10 a böldük her seferinde ir kaçını train diğerlerini validation olarak kullandık
print('Accuracy values are: ',accuracies)

print('average accuracy: ',np.mean(accuracies))
print('average std: ',np.std(accuracies))

knn.fit(x_train,y_train)
print('test accuracy: ',knn.score(x_test,y_test))

In [None]:
#grid search cross validation
from sklearn.model_selection import GridSearchCV
#grid in içine tune etmek istediğimiz parametreyi yazıyoruz
grid={'n_neighbors':np.arange(1,50)}
knn=KNeighborsClassifier()
#öncesinde n_neighbors u elimizle seçiyorduk
#ama şimdi GridSearchCV ile optimum değeri bulduruyoruz
#daha sonra knn ye atayıp knn_cv değerini belirlemek
#için kullanıyoruz 
knn_cv=GridSearchCV( knn,grid,cv=10 )
knn_cv.fit(x,y)

print("tuned hyperprarameter K:",knn_cv.best_params_)
print("the best accuracy score according to \nthe tuned parameter: ",knn_cv.best_score_)

In [None]:
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

from sklearn.linear_model import LogisticRegression


#grid search cv with logistic regression
x=x[:100,:]
y=y[:100]


#C parametresi regularization parametresi dir.Fazla yüksek
#olursa overfit olur model datayı ezberler,çok düşük olursa underfit
#olur ondada model datayı iyi öğrenemez
#l1 ve l2 loss functionlardır lasso ve ridge
grid={'C':np.logspace(-3,3,7),'penalty':['l1','l2']}
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=10)
logreg_cv.fit(x,y)
print('accuracy',logreg_cv.best_score_)

In [None]:
#let us separate them x_train and y_train in fit()
x=x[:100,:]
y=y[:100]

#normalization
x=( x-np.min(x))/(np.max(x)-np.min(x))
#train and test
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)

grid={'C':np.logspace(-3,3,7),'penalty':['l1','l2']}
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=10)
logreg_cv.fit(x_train,y_train)
print('accuracy',logreg_cv.best_score_)

#bu değerlerden yeni bir log_reg modeli oluştur

logreg2=LogisticRegression()
logreg2.fit(x_train,y_train)
print('score2: ',logreg2.score(x_test,y_test))

