# 1. Proposal: 

We strive to solve for the business problem to predict that a wine from Italy falls in which customer segment based on the wine data set available. 
Based on the dataset obtained, wines are grown in the same region in Italy but derived from three different cultivars. The chemical analysis on the data determined the quantities of 13 constituents found in each of the three types of wines.

As part of this project, we intend to perform exploratory analysis on the given historical data, get insights about the data and perform data pre-processing/data wrangling. This should be done for imputation and to get the correlation matrix for the 13 constituents. 
Correlation matrix will help us in getting to know if any of the 13 constituents are correlated and may be grouped to reduce the number of columns or fields in data frame for prediction. 

We should then perform feature engineering techniques like Scaling or Binning on the data based on the pre-processing findings. Based on that, we can determine the relevant feature for our problem.
Once we have our features and findings, we can perform classification using standard ML Algorithms. As any other standard ML Model implementation technique, we may divide the given wine dataset into 2 parts : 70% training, 30% test data. If we find any discrepancies we may update our data set distribution. 

Based on the predictions from the ML models created, we can get the accuracies for them to compare and conclude for the better ML model for our segmentation problem.

#### Importing Required Libraires and Dataset

In [None]:
import pandas as pd #data pre-processing
import numpy as np  #linear algebra

import matplotlib.pyplot as plt #plotting
import seaborn as sns
%matplotlib inline
plt.style.use('ggplot')
from sklearn.metrics import confusion_matrix #alogorithm purpose
from sklearn import metrics

#read csv file - wine.csv and create a dataframe 

df = pd.read_csv('/kaggle/input/wine-customer-segmentation/Wine.csv')
df.head(10)

# 2. Exploratory Analysis and Data Pre-Processing

In [None]:
df.info()

#### Looking at the percentage of missing values per column

In [None]:
df.isnull()

In [None]:
missing_data = pd.DataFrame({'total_missing': df.isnull().sum(), 'perc_missing': (df.isnull().sum()/178)*100})
missing_data

#### Statistical description of numerical variables

In [None]:
df.describe()

##### Let's visualize the numerical quantities in our dataset as boxplots, to have a better sense of the outliers.

In [None]:
num_cols =['Alcohol','Malic_Acid','Ash','Ash_Alcanity','Magnesium','Total_Phenols','Flavanoids','Nonflavanoid_Phenols','Proanthocyanins','Color_Intensity','Hue','OD280','Proline','Customer_Segment']
plt.figure(figsize=(30,12))
df[num_cols].boxplot()
plt.title("Numerical variables in given Wine dataset", fontsize=20)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df.describe()[1:].transpose(),
            annot=True,linecolor="w",
            linewidth=2,cmap=sns.color_palette("Set1"))
plt.title("Wine Data summary")
plt.show()

# Correlation Matrix

In [None]:
cor_mat= df[:].corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)] = False
fig=plt.gcf()
fig.set_size_inches(30,12)
sns.heatmap(data=cor_mat,mask=mask,square=True,annot=True,cbar=True)

##### see the correlation for Customer Segment column in our dataset

In [None]:
corr=df.corr()
corr.sort_values(by=["Customer_Segment"],ascending=False).iloc[0].sort_values(ascending=False)

In [None]:
print('Goup 1:',len(df[df.Customer_Segment == 1]))
print('Group 2:',len(df[df.Customer_Segment == 2]))
print('Group 3:',len(df[df.Customer_Segment == 3]))

In [None]:
plt.rcParams['figure.figsize'] = (20, 10)
size = [59, 71, 48]
colors = ['mediumseagreen', 'c', 'gold']
labels = "Group A", "Group B", "Group C"
explode = [0, 0, 0.1]
plt.subplot(1, 2, 1)
plt.pie(size, colors = colors, labels = labels, explode = explode, shadow = True, autopct = '%.2f%%')
#plt.title('Different Visitors', fontsize = 20)
plt.axis('off')
plt.legend()

In [None]:
plt.figure(figsize=(10,9))
sns.scatterplot(x='Ash_Alcanity',y='Color_Intensity',data=df,palette='Set1', hue = 'Customer_Segment');

#### Check Target variable imalance

In [None]:
df['Customer_Segment'].value_counts(sort = False, normalize = True)*100

Customer Segment looks like balanced with 3 different types. No imabalance treatment required

#### Splitting the data into training set and test set

In [None]:
x = df.drop('Customer_Segment',axis=1)
y = df['Customer_Segment'].values

In [None]:
x.head()

In [None]:
y

## Feature Engineering -  Scalling

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = sc.fit_transform(x)  #standardize the independent features

In [None]:
x

In [None]:
from sklearn.metrics import classification_report,precision_score,recall_score,f1_score,roc_auc_score,accuracy_score
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.3,random_state=1)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

# MODEL BUILDING

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
lr=LogisticRegression()
history = lr.fit(x_train,y_train)
y_pred=lr.predict(x_test)
print("Logistic Regression Algorithm performance result: ",lr.score(x_test,y_test))
#print(y_pred)

# Performance

# confusion matrix

In [None]:
def roc(y_test,y_score):
    from sklearn.preprocessing import label_binarize
    from sklearn.metrics import roc_curve, auc
    y_test = label_binarize(y_test, classes=[1,2,3])
    y_score = label_binarize(y_score, classes=[1,2,3])
    n_classes = 3
    fpr = dict()
    tpr = dict()
    thr = dict()
    roc_auc = dict()
    for i in range(n_classes):
        fpr[i], tpr[i], thr[i] = roc_curve(y_test[:, i], y_score[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])
    return roc_auc[2],fpr[2],tpr[2],thr[2]

def precision_compute(class_id,TP, FP, TN, FN):
    sonuc=0
    
    for i in range(0,len(class_id)):
        if TP[i]==0 or FP[i]==0:
            TP[i]=0.00000000001
            FP[i]=0.00000000001
        sonuc+=(TP[i]/(TP[i]+FP[i]))
        
    sonuc=sonuc/len(class_id)
    return sonuc

def recall_compute(class_id,TP, FP, TN, FN):
    sonuc=0
    for i in range(0,len(class_id)):
        sonuc+=(TP[i]/(TP[i]+FN[i]))
       
    sonuc=sonuc/len(class_id)
    return sonuc
def accuracy_compute(class_id,TP, FP, TN, FN):
    sonuc=0
    for i in range(0,len(class_id)):
        sonuc+=((TP[i]+TN[i])/(TP[i]+FP[i]+TN[i]+FN[i]))
        
    sonuc=sonuc/len(class_id)
    return sonuc
def specificity_compute(class_id,TP, FP, TN, FN):
    sonuc=0
    for i in range(0,len(class_id)):
        sonuc+=(TN[i]/(FP[i]+TN[i]))
        
    sonuc=sonuc/len(class_id)
    return sonuc
def NPV_compute(class_id,TP, FP, TN, FN):
    sonuc=0
    for i in range(0,len(class_id)):
        sonuc+=(TN[i]/(TN[i]+FN[i]))
        
    sonuc=sonuc/len(class_id)
    return sonuc
def perf_measure(y_actual, y_pred):
    class_id = set(y_actual).union(set(y_pred))
    TP = []
    FP = []
    TN = []
    FN = []

    for index ,_id in enumerate(class_id):
        TP.append(0)
        FP.append(0)
        TN.append(0)
        FN.append(0)
        for i in range(len(y_pred)):
            if y_actual[i] == y_pred[i] == _id:
                TP[index] += 1
            if y_pred[i] == _id and y_actual[i] != y_pred[i]:
                FP[index] += 1
            if y_actual[i] == y_pred[i] != _id:
                TN[index] += 1
            if y_pred[i] != _id and y_actual[i] != y_pred[i]:
                FN[index] += 1


    return class_id,TP, FP, TN, FN

In [None]:
score_liste=[]
auc_scor=[]
precision_scor=[]
recall_scor=[]
f1_scor=[]
LR_plus=[]
LR_eksi=[]
odd_scor=[]
NPV_scor=[]
youden_scor=[]
specificity_scor=[]
lrcauc,lrc_fpr,lrc_tpr,lrc_trr=roc(y_test,y_pred)
classid,tn,fp,fn,tp=perf_measure(y_test,y_pred)
auc_scor.append(lrcauc)

score_liste.append(accuracy_compute(classid,tn,fp,fn,tp))
precision_scor.append(precision_compute(classid,tn,fp,fn,tp))
recall_scor.append(recall_compute(classid,tn,fp,fn,tp))
f1_scor.append(f1_score(y_test,y_pred,average='macro'))
NPV_scor.append(NPV_compute(classid,tn,fp,fn,tp))
specificity_scor.append(specificity_compute(classid,tn,fp,fn,tp))
TPR=recall_compute(classid,tn,fp,fn,tp)
TNR=specificity_compute(classid,tn,fp,fn,tp)
FPR=1-TNR
if FPR==0:
    FPR=0.00001
FNR=1-TPR
lreksi=FNR/TNR
lrarti=TPR/FPR
if lreksi==0:
    lreksi=0.00000001
LR_plus.append(TPR/FPR)
LR_eksi.append(FNR/TNR)
odd_scor.append(lrarti/lreksi)
youden_scor.append(TPR+TNR-1)
print("Classification report for the Logistic Regression algorithm: \n",classification_report(y_test,y_pred))

cmlr = confusion_matrix(y_test,y_pred)
f, ax = plt.subplots(figsize =(5,5))
sns.heatmap(cmlr,annot = True,linewidths=0.5,linecolor="red",fmt = ".0f",ax=ax)
plt.xlabel("Estimated")
plt.ylabel("Actual Value")
plt.title("Logistic Regression Algorithm Confusion Matrix")
plt.show()

In [None]:
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtc=DecisionTreeClassifier()
dtc.fit(x_train,y_train)
y_pred_decisiontree=dtc.predict(x_test)
print("Performance result for Decision Trees Algorithm: ",dtc.score(x_test,y_test))
dtcauc,dtc_fpr,dtc_tpr,dtc_trr=roc(y_test,y_pred_decisiontree)
classid,tn,fp,fn,tp=perf_measure(y_test,y_pred_decisiontree)
auc_scor.append(dtcauc)

score_liste.append(accuracy_compute(classid,tn,fp,fn,tp))
precision_scor.append(precision_compute(classid,tn,fp,fn,tp))
recall_scor.append(recall_compute(classid,tn,fp,fn,tp))
f1_scor.append(f1_score(y_test,y_pred,average='macro'))
NPV_scor.append(NPV_compute(classid,tn,fp,fn,tp))
specificity_scor.append(specificity_compute(classid,tn,fp,fn,tp))
TPR=recall_compute(classid,tn,fp,fn,tp)
TNR=specificity_compute(classid,tn,fp,fn,tp)
FPR=1-TNR
if FPR==0:
    FPR=0.00001
FNR=1-TPR
lreksi=FNR/TNR
lrarti=TPR/FPR
if lreksi==0:
    lreksi=0.00000001
LR_plus.append(TPR/FPR)
LR_eksi.append(FNR/TNR)
odd_scor.append(lrarti/lreksi)
youden_scor.append(TPR+TNR-1)

print("Classification report for Decision Tree algorithm: \n",classification_report(y_test,y_pred_decisiontree))

cmdtc = confusion_matrix(y_test,y_pred_decisiontree)
f, ax = plt.subplots(figsize =(5,5))
sns.heatmap(cmdtc,annot = True,linewidths=0.5,linecolor="red",fmt = ".0f",ax=ax)
plt.xlabel("Estimated")
plt.ylabel("Actual Value")
plt.title("Decision Trees Algorithm Confusion Matrix")
plt.show()

Hyper-parameter Tuning

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {
'criterion':['gini','entropy'],
'max_depth':[4,6,8,12]
}
g_dt = GridSearchCV(dtc,param_grid=param_grid,n_jobs=-1,cv=5,scoring='accuracy')
g_dt.fit(x_train,y_train)
g_dt.best_params_
g_dt.best_score_
f_dt = DecisionTreeClassifier(criterion='gini',max_depth=12)
f_dt.fit(x_train,y_train)
y_pred1 = f_dt.predict(x_test)
print('Test Accuracy:',accuracy_score(y_test,y_pred1))


# Comparision of Performance Classifiers

In [None]:
algo_list=["Logistic Regression","Decision Tree"]
score={"algo_list":algo_list,"precision":precision_scor,"recall":recall_scor,"f1_score":f1_scor,"AUC":auc_scor,"Specificity":specificity_scor}

In [None]:
df2=pd.DataFrame(score)
df2

# ROC Curve

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, roc_auc_score, classification_report
pred_prob1 = lr.predict_proba(x_test)
pred_prob2 = dtc.predict_proba(x_test)

# roc curve for models
fpr1, tpr1, thresh1 = roc_curve(y_test, pred_prob1[:,1], pos_label=1)
fpr2, tpr2, thresh2 = roc_curve(y_test, pred_prob2[:,1], pos_label=1)

# roc curve for tpr = fpr 
random_probs = [0 for i in range(len(y_test))]
p_fpr, p_tpr, _ = roc_curve(y_test, random_probs, pos_label=1)


# plot roc curves
plt.plot(fpr1, tpr1, linestyle='--',color='orange', label='Logistic Regression')
plt.plot(fpr2, tpr2, linestyle='--',color='green', label='Decision tree')
plt.plot(p_fpr, p_tpr, linestyle='--', color='blue')
# title
plt.title('ROC curve')
# x label
plt.xlabel('False Positive Rate')
# y label
plt.ylabel('True Positive rate')

plt.legend(loc='best')
plt.savefig('ROC',dpi=300)
plt.show();

# Conclusion

1. Performed Exploratary Analysis to find the insights about the data
2. Performed Data Pre- processing, but there is no empty or null values in the given data set, so that we considered the data as itis.
3. We Applied Feature Scaling since all the features are independant and performed feature subselction for dropping of not useful features.
4. Plotted top 10 features in a heat map
5. We Build Models for Logistic regression and Decision Tree
6. We Find the Confusion matrix for Performance matrics and we compared for classifiers Logistic regression and Decision tree.
7. We plotted the ROC curve for validation of train and test results.

Based on the Test results, Logistic regression Classifier is having higher accuracy than Decision Tree, so that we can recommend Logistic Regression classifier approach to identify the customer segment.