
In multivariate analysis, i will not focus on distribution of data, this has been discussed in Notebook 1 (Univariate Statistical Analysis). Please find the notebook below ==>

[Univariate Statistical Analysis](https://www.kaggle.com/ravichaubey1506/univariate-statistical-analysis-on-diabetes)

The notebook you are reading is second notebook in this series.

Here is third notebook which gives you an understanding of how to make inference about population from population.

[Inferential Statistics on Diabetes](https://www.kaggle.com/ravichaubey1506/inferential-statistics-on-diabetes)




## What is “multivariate”?

Multivariate data analysis is a set of statistical models that examine patterns in multidimensional data by considering, at once, several data variables. It is an expansion of bivariate data analysis, which considers only two variables in its models. As multivariate models consider more variables, they can examine more complex phenomena and find data patterns that more accurately represent the real world.

In [None]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')

In [None]:
df = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
df.head()

**I am conveting Outcome as Diab for Diabetic Patients and Non-Diab for Non Diabetic Patients. Also i will be renaming column name DiabetesPedigreeFunction to DPF.**

In [None]:
df.Outcome = df.Outcome.replace({0:'Non-Diab',1:'Diab'})
df.DiabetesPedigreeFunction = df.rename({'DiabetesPedigreeFunction':'DPF'},inplace = True,axis =1)
df.head()

In [None]:
df.dtypes

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe().T

# Basic Summary

Data is related to healthcare Industry having 768 observations with 9 variable. Target variable is Outcome. It looks like there is no missing value, and boolean, float , integers are different datatypes available. Well descriptive analysis shows that variable Glucose, BoodPressure,SckinThickness, Insulin and BMI have minimum value 0 which does not make any sense, these values are either missing or outliers, But i am not going to alter them so that i can see actual statistics of Data. I can see in Pregnancies column, minimum is 0 (May be this is sign for no pregnancy) which is considerable, But maximum month of pregnancy is 17 which does not make any sense. Variance among different predictor variable is varying at large scale , Scaling data will be helpful for Predective modelling.

# Pairplot

**Let us take a closer look at data by doing a quick visualization**

In [None]:
plt.figure(dpi=120)
sns.pairplot(df)
plt.show()

### Summary ==>

Well, Pregnancies, Insulin, DBF and Age having skewed distribution. We know most of the machine learning models uses assumpton of normality so these variables might need to be scaled, But we may consider the assumption to be true according Central Limit Theorem that if number of observation is large we can consider the distribution to be normal or bell shaped. Removing Outliers may also help us to achieve normal distribution of that variable.

It looks like Glucose, BP and BMI variables have some outliers.

Variables are not correlated strongly with each other. I will plot a correlation matrix later.

#### Let us Plot pairplot according to outcome

In [None]:
plt.figure(dpi = 120)
sns.pairplot(df,hue = 'Outcome',palette = 'plasma')
plt.legend(['Non Diabetic','Diabetic'])
plt.show()

### Summary ==>

We can clearly see that data points are not seperable linearly according to Outcome. <font color = 'blue'>Distribution of variables are normal</font>, In some variables they are skewed to right due to Outliers. Treating Outliers may help to get rid of them. Because data points are spread non linear, Fitting tree based models might help us to get better accuracy or SVC with Non Linear Dicision Boundry.

# Correlation

In [None]:
plt.figure(dpi = 120,figsize= (5,4))
mask = np.triu(np.ones_like(df.corr(),dtype = bool))
sns.heatmap(df.corr(),mask = mask, fmt = ".2f",annot=True,lw=1,cmap = 'plasma')
plt.yticks(rotation = 0)
plt.xticks(rotation = 90)
plt.title('Correlation Heatmap')
plt.show()

<font color = 'blue'>**Nice, Variables are not much associated linearly.**</font>

# Jointplots

In [None]:
plt.figure(dpi = 100, figsize = (5,4))
print("Joint plot of Glucose with Other Variables ==> \n")
for i in  df.columns:
    if i != 'Glucose' and i != 'Outcome':
        print(f"Correlation between Glucose and {i} ==> ",df.corr().loc['Glucose'][i])
        sns.jointplot(x='Glucose',y=i,data=df,kind = 'regression',color = 'purple')
        plt.show()

## Insights

<font color='blue'>Glucose shows positive weak linear association with other variable in given dataset.</font> That means On increasing Glucose level in patients, Other variables will also increase. Weak linear association is good, so that we can escape out from Multicollinearity effect in Predective Modelling.

In [None]:
col = list(df.columns)
idx = col.index('BloodPressure')

plt.figure(dpi = 100, figsize = (5,4))
print("Joint plot of BloodPressure with Other Variables ==> \n")
for i in  range(idx+1,len(col)-1):
    print(f"Correlation between BloodPressure and {col[i]} ==> ",df.corr().loc['BloodPressure'][col[i]])
    sns.jointplot(x='BloodPressure',y=col[i],data=df,kind = 'regression',color = 'green')
    plt.show()

## Insights

<font color='blue'>BloodPressure shows positive weak linear association with other variable in given dataset.</font> That means On increasing BP level in patients, Other variables will also increase.

In [None]:
col = list(df.columns)
idx = col.index('SkinThickness')

plt.figure(dpi = 100, figsize = (5,4))
print("Joint plot of SkinThickness with Other Variables ==> \n")
for i in  range(idx+1,len(col)-1):
    print(f"Correlation between SkinThickness and {col[i]} ==> ",df.corr().loc['SkinThickness'][col[i]])
    sns.jointplot(x='SkinThickness',y=col[i],data=df,kind = 'regression',color = 'blue')
    plt.show()

## Insights

<font color='blue'>SkinThickness shows positive weak linear association with other variable in given dataset <font color = 'red'> ,(Except with Age) </font>.</font> That means On increasing SkinThickness in patients, Other variables will also increase.<font color = 'blue'>SkinThickness with Age show a weak negative correlation, that means on increasing SkinThickness , Age must decrease.</font>

In [None]:
col = list(df.columns)
idx = col.index('Insulin')

plt.figure(dpi = 100, figsize = (5,4))
print("Joint plot of Insulin with Other Variables ==> \n")
for i in  range(idx+1,len(col)-1):
    print(f"Correlation between Insulin and {col[i]} ==> ",df.corr().loc['Insulin'][col[i]])
    sns.jointplot(x='Insulin',y=col[i],data=df,kind = 'regression',color = 'green')
    plt.show()

## Insights

<font color='blue'>Insulin shows positive weak linear association with other variable in given dataset <font color = 'red'> ,(Except with Age) </font>.</font> That means On increasing Insulin level in patients, Other variables will also increase.<font color = 'blue'>Insulin with Age show a weak negative correlation, that means on increasing SkinThickness , Age must decrease.</font>

In [None]:
col = list(df.columns)
idx = col.index('BMI')

plt.figure(dpi = 100, figsize = (5,4))
print("Joint plot of BMI with Other Variables ==> \n")
for i in  range(idx+1,len(col)-1):
    print(f"Correlation between BMI and {col[i]} ==> ",df.corr().loc['BMI'][col[i]])
    sns.jointplot(x='BMI',y=col[i],data=df,kind = 'regression',color = 'green')
    plt.show()

## Insights

<font color='blue'>BMI shows positive weak linear association with other variable in given dataset.</font> That means On increasing BMI level in patients, Other variables will also increase.

In [None]:
col = list(df.columns)
idx = col.index('DPF')

plt.figure(dpi = 100, figsize = (5,4))
print("Joint plot of DPF with Other Variables ==> \n")
for i in  range(idx+1,len(col)-1):
    print(f"Correlation between DPF and {col[i]} ==> ",df.corr().loc['DPF'][col[i]])
    sns.jointplot(x='DPF',y=col[i],data=df,kind = 'regression',color = 'red')
    plt.show()

## Insights

<font color='blue'>DPF shows positive weak linear association with other variable in given dataset.</font> That means On increasing DPF in patients, Other variables will also increase.

# Outcome

Let us see, how data is behaving with Target variable using PCA

In [None]:
x= df.iloc[:,:-1].values
y= df.iloc[:,-1].values

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(x)

x_new = pca.transform(x)

xs = x[:,0]
ys = x[:,1]

plt.figure(dpi=100)
sns.scatterplot(x=xs,y=ys,hue=y).set_title('Dependency of Data with Outcome')
plt.xlabel('PCA Feature 1')
plt.ylabel('PCA Feature 2')
plt.show()

** Fitting a linea model to this data will not lead to better accuracy because data points are not linearly seperable, May be in higher dimension we can some more details in my 4th Notebook. For now fitting a tree based model or neural network will help us to achieve more accuracy**

