In [1]:
import pandas as pd 
import numpy as np  
import seaborn as sns 
import matplotlib.pyplot as plt 
%matplotlib inline 
  
import warnings 
warnings.filterwarnings('ignore')

### 1) Read and View

- Head and Tail
- Shape
- Dtypes
- Describe
- Info
- Shape

In [4]:
# df = pd.read_csv()
# df.head()
# df.tail()
# df.dtypes()
# df.describe()
# df.info()

### 2) Data Preparation
This involves performing operations on the dataset to make it look more clean, removing outliers, etc.

- Dropping irrelevant rows and columns
- Renaming columns(if needed)
- Dropping Null values
- Identifying duplicated columns

In [6]:
# df[cols].hist(figsize = (11,5), color="#800080") -- Look for Distribution
# df[cols].boxplot(figsize = (11,5), color="#800080") -- look for Outliers



#### Detecting outlier
- Global outliers: Stand out from the entire dataset, like a lone wolf.
- Contextual outliers: Depend on their surroundings, like a high sale at a clothing store.
- Collective outliers: Groups that deviate together, like a cluster of oddly high values.

<b>Normal Distrbution (Z-score Treatment)</b>
- *mean-3 x (sigma) or above mean+3 x (sigma)*
- Mean minus 3 times sigma: $$\mu - 3\sigma$$
- Mean plus 3 times sigma: $$\mu + 3\sigma$$


<b>Skewed Distrbution (IQR Treatment)</b><br>
*Q1 - 1.5 × IQR or above the third quartile Q3 + 1.5 × IQR are outliers, where Q1 and Q3 are the 25th and 75th percentile of the dataset, respectively. IQR represents the inter-quartile range and is given by Q3 - Q1.*

Here are the expressions for identifying outliers:
- Inter Quartile Range: $$ Q3 - Q1$$
- Lower bound: $$ Q1 - 1.5 \times \text{IQR} $$
- Upper bound: $$ Q3 + 1.5 \times \text{IQR} $$

#### Treating outliers
- Trimming
- Capping
- Imputing

### 3) Exploratory Data Analysis

*Data Visualisation Data visualisation refers to graphical representation of data to communicate complex information in concise, and understandable manner. There are mainly three types of data visualisation i.e. univariate analysis, bivariate analysis and multivariate analysis.*

- <b>Univariate analysis</b> is the most straightforward method which involves examining only one variable at a time using descriptive statistics like mean, median, mode, standard deviation, and range. The purpose of this analysis is to summarize the data and identify any patterns or trends.

- <b>Bivariate analysis</b> is the study of the relationship between two variables, which can be determined by using correlation analysis, scatter plots, and other statistical methods. The main goal of this analysis is to establish whether there is a connection between the two variables and to comprehend the strength and direction of that connection.

- <b>Multivariate analysis</b>, on the other hand, is a more intricate type of analysis that involves examining the relationships between three or more variables. It is commonly used in fields like finance, marketing, and social science to identify patterns, trends, and relationships not apparent from univariate or bivariate analysis.

In [5]:
# sns.pairplot(df, hue='Outcome', palette=('#FFFF00', '#800080')) -- Relationship between variables
# sns.heatmap(df.corr(), annot=True, cmap='PiYG') -- correlation coefficients between all pairs of variables

### 3) Modelling Building
*Model building creates a representation of a system for prediction or understanding.*

Descriptive statistics provide a summary of the central tendency, dispersion, and shape of a dataset.




In [7]:
from sklearn.model_selection import train_test_split 
# x = df.drop(['Outcome'],axis=1) 
# y = df['Outcome']

In [8]:
from sklearn.preprocessing import StandardScaler 
  
# sc= StandardScaler() 
# x_scaled= sc.fit_transform(x)

In [9]:
## split the data into train and test.

# x_train, x_test, y_train, y_test = train_test_split( 
#                                     x_scaled, y, 
#                                     test_size=0.3, 
#                                     random_state=0) 

In [10]:
# x_train.shape, y_train.shape 
# x_test.shape, y_test.shape


In [12]:
from sklearn.linear_model import LogisticRegression 
logreg = LogisticRegression() 
# logreg.fit(x_train, y_train) 
# y_pred = logreg.predict(x_test)

Computing Confusion Matrix


In [13]:
from sklearn.metrics import confusion_matrix, accuracy_score 
# confmat = confusion_matrix(y_pred, y_test) 
# confmat

In [14]:
from sklearn import metrics 
# cm = metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_pred, y_test, labels=logreg.classes_), 
#                                     display_labels=logreg.classes_) 
# cm.plot(cmap="RdYlBu") -- plot the matrix

Model Accuracy with Logistic Regression

In [15]:
# accuracy_score(y_pred, y_test)