<a href="https://colab.research.google.com/github/valy3124/SupervisedLearning/blob/main/IS_Lab1_Citirea_%C8%99i_vizualizarea_datelor_tabelare.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dataset: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database?resource=download

The dataset consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.



# Importing pandas and reading our data

Pandas is a Python library used for working with data sets. It has functions for analyzing, cleaning, exploring, and manipulating data.

In [None]:
import pandas as pd

df = pd.read_csv('diabetes.csv')
print(df.head())

In [None]:
print(df.tail(7))

Extracting a list with the names of the columns

In [None]:
columns = list(df.columns.values)
print(columns)

In [None]:
df.info()

# Indexing and Slicing

In [None]:
# accessing the row with index 2
df.iloc[2]

In [None]:
# accessing first 5 rows and first 3 columns
df.iloc[0:5, 0:3]

In [None]:
# index of entry and BMI value
for idx, row in df.iterrows():
    print(idx, row["BMI"])

# Filter Data

In [None]:
# people with a BMI lower than 20
df.loc[df['BMI']<20]

##### iloc (Index-based Accessor) :
It's used for label-based selection of rows and columns
##### loc (Label-based Accessor):
It's used for integer-based selection of rows and columns.

# Condition-based filtering

In [None]:
df.loc[df['Pregnancies']>12]

In [None]:
filtered_df_neg = df.loc[((df['BloodPressure']>80) | (df['BMI']>=40)) & (df['Outcome']==0) ]
filtered_df_pos = df.loc[((df['BloodPressure']>80) | (df['BMI']>=40)) & (df['Outcome']==1) ]

print("No. of people with diabetes and obesity and high blood presure ", len(filtered_df_pos))
print("No. of people without diabetes and obesity and high blood presure ",len(filtered_df_neg))

In [None]:
#generate descriptive statistics
df.describe()

In [None]:
#changing columns names
df.rename(columns = {'DiabetesPedigreeFunction':'GeneticProbabilityOfDiabetes'}, inplace = True)

# Sorting

In [None]:
# printing Glucose and Insulin values sorted ascending by Glucose
print(df[['Glucose', 'Insulin']].sort_values('Glucose'))

In [None]:
df.sort_values("Age", ascending=False)

In [None]:
# sorting data ascending by age and descending by Glucose
# in order to quickly spot the youngest people with the highest level of glucose
df.sort_values(["Age", "Glucose"], ascending=[True, False])

##### Groupby
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

In [None]:
mean_values_age = df.groupby(['Age']).mean()
#mean_values_age = mean_values_age.drop(columns=['Outcome'])
mean_values_age = mean_values_age.drop('Outcome', axis=1)
mean_values_age.sort_values('GeneticProbabilityOfDiabetes', ascending=False)

In [None]:
# saving the Outcome values
target = df["Outcome"]
target

In [None]:
# eliminating columns
df.drop(columns=["Outcome"])

In [None]:
df

In [None]:
# we have to make the attribution in order to apply the change
df=df.drop(columns=["Outcome"])

In [None]:
df

In [None]:
df=df.join(target)
#df = pd.concat([df, target], axis=1)

In [None]:
df

# Changing values based on conditions

In [None]:
df.loc[(df['Glucose']>150) & (df['Outcome']==0) & (df['BMI']>=40)]

In [None]:
# If the glucose and BMI values are high and that person is not marked as sick there is most likely a problem.
# We extract the indexes of those rows and modify the Outcome value to 1
rows_to_modify = list(df.loc[(df['Glucose']>150) & (df['Outcome']==0) & (df['BMI']>=40)].index.values)
rows_to_modify

In [None]:
df.loc[(df['Glucose']>150) & (df['Outcome']==0) & (df['BMI']>=40), 'Outcome'] = 1
df.loc[rows_to_modify]

In [None]:
selection_condition = df['Age'].isin([21,23])
df[selection_condition]

# Saving Files

In [None]:
df.to_csv("diabetesIndexed.csv")

In [None]:
df.to_csv("diabetesNotIndexed.csv", index=False)

In [None]:
df.to_excel("diabetesExcel.xlsx", index=False)

In [None]:
df2 = pd.read_excel("diabetesExcel.xlsx", index_col=False)

In [None]:
df2

# Visualization

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Bar plots

It's important to check for Data Imbalance. For classification tasks, check the distribution of the target variable to identify if the classes are imbalanced using .value_counts() or bar plots.

In [None]:
df = pd.read_csv('diabetes.csv')

In [None]:
diabetes_counts = df['Outcome'].value_counts()
plt.figure(figsize=(6, 4))
diabetes_counts.plot(kind='bar', color=['blue', 'red'])
plt.title('Data Imbalance')
plt.ylabel('Number of People')
plt.xlabel('Outcome')

plt.show()
# or sns.countplot(x='Outcome', data= df)

Method of dealing with imbalanced classes: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

# Box plots
We can use boxplots to detect outliers in numerical data

In [None]:
plt.figure(figsize=(12,12))
for i, col in enumerate(columns):
    plt.subplot(3,3, i+1)# sublot on which our boxplot is drawn for identifying outlier
    sns.boxplot(x=col, data = df)
plt.show()

# Histograms
We can plot histograms to understand the distribution of continuous variables. This also helps in detecting outliers.

In [None]:
plt.figure(figsize=(12,12))
for i, col in enumerate(columns):
    plt.subplot(3,3, i+1)
    sns.histplot(x=col, data = df, kde = True)
plt.show()

# Heatmaps
We can use heatmaps to visualize the correlation between variables, which can reveal clusters or correlations.

In [None]:
plt.figure(figsize= (12,12))
sns.heatmap(df.corr(),cmap='coolwarm', annot = True)
plt.show()

Correlation coefficients:

1: as one variable increases, the other variable increases proportionally

-1: as one variable increases, the other decreases proportionally

0: no linear relationship between the variables

We can observe that the highest positive correlation (0.54) is between Age and Pregnancies, suggesting that as age increases, the number of pregnancies tends to increase as well.

Glucose and Outcome show a  positive correlation of 0.47, indicating that higher glucose levels are associated with a higher likelihood of diabetes.

We can also observe a corelation between Insulin and SkinThickness implying that higher insulin levels tend to be associated with thicker skin, which is indeed something mentioned in specialized literature: https://pubmed.ncbi.nlm.nih.gov/2721339/

In [None]:
corr_matrix = df.corr()
corr_matrix

# Scatterplots
A scatter plot identifies a possible relationship between changes observed in two different sets of variables.

In [None]:
plt.figure(figsize=(20, 12))
sns.scatterplot(data=df, x='Glucose', y='BMI', hue='Outcome', style='Outcome', s=100)
plt.title('Scatter Plot of Glucose vs BMI')
plt.xlabel('Glucose Level')
plt.ylabel('BMI')
plt.legend(title='Outcome', loc='upper left', labels=['No Diabetes', 'Diabetes'])
plt.grid(True)
plt.show()

# Missing values

In datasets, these missing entries might appear as the letter "0", "NA", "NaN", "NULL", "Not Applicable", or "None”. In this case we observe that missing data is marked with 0. No one can have a BMI of 0.

In [None]:
missing_values = (df[['SkinThickness', 'BloodPressure', 'BMI']] == 0).sum()
print("Number of missing values:")
print(missing_values)

In [None]:
mean_skin_thickness = df['SkinThickness'][df['SkinThickness'] != 0].mean()
df['SkinThickness'] = df['SkinThickness'].replace(0, mean_skin_thickness)
missing_values = (df[['SkinThickness', 'BloodPressure', 'BMI']] == 0).sum()
print("Number of missing values:")
print(missing_values)

Mean Imputation: useful when the data follows a normal distribution, the mean is a good representation of the central tendency.

Median Imputation: it's less sensitive to outliers than the mean.

Mode Imputation: useful for categorical data or discrete numerical data. The mode is the value that appears most frequently in a dataset. In the context of mode imputation, it is used to replace missing values with the most common value in a given column.

Forward Fill: It replaces missing values with the last observed non-missing value in the column.

Backward Fill: It replaces missing values with the next observed non-missing value in the column.


# Categorical Features
Stroke dataset: https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data

In [None]:
stroke_df = pd.read_csv("stroke-data.csv", index_col=0)

In [None]:
stroke_df

In the previous dataset we only had numerical data, but in this dataset we also home categorical features. Since machine learning algorithms can only process numerical values we need to convert these categorical features into appropriate numerical representations. In order to do this we can use:

Label Encoder: Encode target labels with value between 0 and n_classes-1. ![image.png](attachment:image.png)

One-Hot Encoder: Encode categorical features as a one-hot numeric array.
![image-2.png](attachment:image-2.png)

In [None]:
stroke_df.head()

In [None]:
stroke_one_hot_df = pd.get_dummies(stroke_df)
stroke_one_hot_df.head()

In [None]:
# @title
categorical_columns = stroke_df.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_columns

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_mappings = {}

for column in categorical_columns:
    stroke_df[column] = label_encoder.fit_transform(stroke_df[column])
    label_mappings[column] = dict(zip(label_encoder.classes_, range(len(label_encoder.classes_))))

print(label_mappings)
stroke_df.head()

# Exercises

1. Determine the target value (the value that we want to predict/classify depending on task) and create a plot to see if the dataset we are using has class imbalance or not.

In [None]:
#TO DO

stroke_counts = stroke_df['stroke'].value_counts()
plt.figure(figsize=(6, 4))
stroke_counts.plot(kind='bar', color=['blue', 'red'])
plt.title('Data Imbalance')
plt.ylabel('Number of People')
plt.xlabel('Has Stroke')

plt.show()


2. Using any graphical representation of your choice, determine if there are correlations between the features. Choose 2 correlations that you identified as relevant and explain them.

In [None]:
#TO DO

plt.figure(figsize= (12,12))
sns.heatmap(stroke_df.corr(),cmap='coolwarm', annot = True)
plt.show()

As we can see, gender and smoking status are not correlated because a man or a woman are not more prone to smoke than the other.

3. Count and display how many entries are missing for the BMI feature, and fill in the missing values. Justify your choice of method.

In [None]:
#TO DO
missing_bmi = stroke_df['bmi'].isnull().sum()
print(missing_bmi)

stroke_df['bmi'].fillna(stroke_df['bmi'].median(), inplace=True)


4. Determine the number of people who had both heart disease and high blood pressure, and categorize them as either smokers or non-smokers.

In [None]:
#TO DO
filter_df = stroke_df.loc[(stroke_df['heart_disease'] == 1) & (stroke_df['hypertension'] == 1)]


smoker_counts = filter_df.groupby(by=["smoking_status"])['age'].count()
smoker_counts

5. Sort the DataFrame by age in ascending order and by glucose level in descending order. Then, rearrange the column names in alphabetical order, keeping the stroke column as the last one.

In [None]:
#TO DO
stroke_df.sort_values(["age", "avg_glucose_level"], ascending=[True, False])

cols = sorted(stroke_df.columns.tolist())
cols.remove('stroke')
cols.append('stroke')
stroke_df = stroke_df[cols]
stroke_df


Save the final csv file

In [None]:
#TO DO
stroke_df.to_csv("stroke_final.csv", index=False)