<a href="https://colab.research.google.com/github/pyaidatascience/ML-ADYPU/blob/main/EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Exploratory Data Analysis (EDA) process with step-by-step Python code examples and explanations for each part.

1. Measuring Central Tendency
Calculating Mean, Median, and Mode:

In [None]:
import pandas as pd
import numpy as np
from scipy import stats

# Sample DataFrame
df = pd.DataFrame({
    'Value': [1, 2, 2, 3, 4, 6, 6, 8, 9, 10]
})

# Mean
mean = df['Value'].mean()
print("Mean:", mean)

# Median
median = df['Value'].median()
print("Median:", median)

# Mode
mode = stats.mode(df['Value'])[0]
print("Mode:", mode)


Explanation: This code calculates the mean, median, and mode for a given dataset. The mean() method calculates the average, median() calculates the middle value, and mode() from scipy calculates the most frequent value.

2. Measuring Variance and Range
Calculating Variance and Range:

In [None]:
# Variance
variance = df['Value'].var()
print("Variance:", variance)

# Range
value_range = df['Value'].max() - df['Value'].min()
print("Range:", value_range)


Explanation: This code calculates the variance (how much the data varies from the mean) and range (difference between the highest and lowest values).

3. Working with Percentiles
Calculating Percentiles:

In [None]:
# 25th, 50th, and 75th Percentiles
percentiles = np.percentile(df['Value'], [25, 50, 75])
print("Percentiles (25th, 50th, 75th):", percentiles)


Explanation: This code calculates the 25th, 50th, and 75th percentiles of the dataset using numpy's percentile() function.

4. Detecting Outliers
Using Z-scores to Detect Outliers:

In [None]:
z_scores = np.abs(stats.zscore(df['Value']))
outliers = df['Value'][z_scores > 3]
print("Outliers:", outliers)


Explanation: This code uses Z-scores to detect outliers. Data points with a Z-score greater than 3 are considered outliers.

5. Counting for Categorical Data
Counting Frequencies for Categorical Data:

In [None]:
# Sample DataFrame with categorical data
df_cat = pd.DataFrame({
    'Category': ['A', 'B', 'A', 'C', 'A', 'B', 'C', 'C', 'B', 'A']
})

# Frequency count
frequency_counts = df_cat['Category'].value_counts()
print("Frequency Counts:", frequency_counts)


Explanation: This code counts the frequency of each category in a categorical dataset using value_counts().

6. Creating Contingency Tables
Creating a Contingency Table:

In [None]:
# Sample DataFrame with two categorical columns
df_contingency = pd.DataFrame({
    'Category1': ['A', 'B', 'A', 'C', 'A'],
    'Category2': ['X', 'Y', 'X', 'Y', 'X']
})

# Contingency table
contingency_table = pd.crosstab(df_contingency['Category1'], df_contingency['Category2'])
print("Contingency Table:", contingency_table)


Explanation: This code creates a contingency table using crosstab() to show the frequency distribution between two categorical variables.

7. Visualizing Data
Creating Boxplots, Scatter Plots, and Histograms:

In [None]:
import matplotlib.pyplot as plt

# Boxplot
plt.boxplot(df['Value'])
plt.title('Boxplot of Values')
plt.show()

# Scatter plot
plt.scatter(df.index, df['Value'])
plt.title('Scatter Plot of Values')
plt.xlabel('Index')
plt.ylabel('Value')
plt.show()

# Histogram
plt.hist(df['Value'], bins=5)
plt.title('Histogram of Values')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()


Explanation: This code creates a boxplot, scatter plot, and histogram to visualize the data distribution and identify outliers.

8. Using Covariance and Correlation
Calculating Covariance and Correlation:

In [None]:
# Sample DataFrame with two columns
df_corr = pd.DataFrame({
    'X': [1, 2, 3, 4, 5],
    'Y': [2, 3, 4, 5, 6]
})

# Covariance
covariance = df_corr.cov().iloc[0, 1]
print("Covariance:", covariance)

# Correlation
correlation = df_corr.corr().iloc[0, 1]
print("Correlation:", correlation)


Explanation: This code calculates the covariance and correlation between two variables. Covariance indicates the direction of the relationship, while correlation measures the strength and direction.

9. Creating a Z-score Standardization
Standardizing Data with Z-scores:

In [None]:
# Z-score standardization
df_standardized = (df['Value'] - df['Value'].mean()) / df['Value'].std()
print("Z-score Standardized Data:", df_standardized)


Explanation: This code standardizes the data by converting values to Z-scores, indicating how many standard deviations each value is from the mean.

10. Detecting Outliers Using IQR
Using IQR to Detect Outliers:

In [None]:
Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = df['Value'][(df['Value'] < (Q1 - 1.5 * IQR)) | (df['Value'] > (Q3 + 1.5 * IQR))]
print("Outliers (IQR method):", outliers_iqr)


Explanation: This code uses the Interquartile Range (IQR) method to detect outliers. Values outside the range [Q1 - 1.5IQR, Q3 + 1.5IQR] are considered outliers.