<a href="https://colab.research.google.com/github/samreenshakeel/Data-Science./blob/main/Copy_of_Data_Science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Exploratory Data Analysis:**

Exploratory Data Analysis (EDA) is the process of examining and visualizing data to understand its structure, patterns, and anomalies. It helps prepare the data for modeling by revealing insights and guiding further analysis.

**Exploratory Data Analysis on Student Dataset:**

In this notebook, we perform a simple exploratory data analysis (EDA) on a sample dataset containing information about students, including their names, ages, and marks. The goal is to understand the basic structure and statistics of the data.

We will cover the following steps:

1-Display the first few rows of the dataset to get an overview.

2-Generate summary statistics for numerical columns.

3-Count the total number of students.

4_Calculate the average age of students.

5-Identify the highest marks obtained.

6-Count how many students are older than a certain age (e.g., greater than 22).

This initial EDA helps us understand the distribution, central tendencies, and patterns within the data before we proceed to more advanced analysis or modeling.

**Explanation:**

* df.head() shows the first few rows of the dataset.

* df.describe() gives summary statistics like mean, std, min, max, etc.

* df['Name'].count() counts the number of students.

* df['Age'].mean() finds the average age of students.

* f['Marks'].max() finds the highest marks obtained by a student.

* df[df['Age'] > 22] filters students older than 22.


In [None]:
import pandas as pd

# Sample DataFrame with students' data
data = {
    'Name': ['Iqra', 'Sara', 'Eman', 'Saba', 'Noor'],
    'Age': [23, 25, 22, 24, 21],
    'Marks': [88, 92, 85, 90, 87]
}

df = pd.DataFrame(data)

# 1. Checking the first few rows of the dataset
print("First 5 rows:")
print(df.head())

# 2. Summary statistics for numerical columns
print("\nSummary statistics:")
print(df.describe())

# 3. Count of students
print("\nNumber of students:")
print(df['Name'].count())

# 4. Average age of students
print("\nAverage age of students:")
print(df['Age'].mean())

# 5. Highest marks obtained
print("\nHighest marks:")
print(df['Marks'].max())

# 6. Count of students above a certain age (e.g., age > 22)
print("\nStudents older than 22:")
print(df[df['Age'] > 22].count())


First 5 rows:
   Name  Age  Marks
0  Iqra   23     88
1  Sara   25     92
2  Eman   22     85
3  Saba   24     90
4  Noor   21     87

Summary statistics:
             Age      Marks
count   5.000000   5.000000
mean   23.000000  88.400000
std     1.581139   2.701851
min    21.000000  85.000000
25%    22.000000  87.000000
50%    23.000000  88.000000
75%    24.000000  90.000000
max    25.000000  92.000000

Number of students:
5

Average age of students:
23.0

Highest marks:
92

Students older than 22:
Name     3
Age      3
Marks    3
dtype: int64


# **Handling Missing Values:**
Real-world datasets often contain missing or incomplete data, which can affect the accuracy and performance of data analysis and machine learning models.

**Handling Missing Values in a Dataset**

In this notebook, we demonstrate how to detect and handle missing values using a sample dataset of student information.

We will perform the following steps:

Display the first few rows of the dataset to understand its structure.

1-Identify the missing values in each column.

2-Fill the missing values in the 'Age' column using the mean age.

3-Fill the missing values in the 'Marks' column using the median marks.

4-Verify that all missing values have been properly handled.

This basic preprocessing step ensures that the dataset is clean and ready for further analysis or modeling.

**Explanation:**

* df.isnull().sum() shows how many missing values there are in each column.

* df['Age'].fillna(df['Age'].mean()) fills missing values in the "Age" column with the mean of the available ages.

* df['Marks'].fillna(df['Marks'].median()) fills missing values in the "Marks" column with the median of the available marks.

* After filling, we check if there are still missing values with df.isnull().sum().

This code is useful for basic imputation of missing values and ensures that the dataset is complete for analysis.



In [None]:
import pandas as pd
import numpy as np

# Sample DataFrame with missing values
data = {
    'Name': ['Iqra', 'Sara', 'Eman', 'Saba', 'Noor'],
    'Age': [23, np.nan, 22, 24, np.nan],
    'Marks': [88, 92, np.nan, 90, 87]
}

df = pd.DataFrame(data)

# 1. Checking the first few rows of the dataset
print("First 5 rows:")
print(df.head())

# 2. Checking for missing values
print("\nMissing values in each column:")
print(df.isnull().sum())

# 3. Fill missing values in 'Age' with the mean age
df['Age'] = df['Age'].fillna(df['Age'].mean())

# 4. Fill missing values in 'Marks' with the median marks
df['Marks'] = df['Marks'].fillna(df['Marks'].median())

# 5. Checking if there are any missing values after filling
print("\nMissing values after filling:")
print(df.isnull().sum())

# 6. Counting students with missing marks before filling
missing_marks_count = df['Marks'].isnull().sum()
print(f"\nNumber of students with missing marks before filling: {missing_marks_count}")


First 5 rows:
   Name   Age  Marks
0  Iqra  23.0   88.0
1  Sara   NaN   92.0
2  Eman  22.0    NaN
3  Saba  24.0   90.0
4  Noor   NaN   87.0

Missing values in each column:
Name     0
Age      2
Marks    1
dtype: int64

Missing values after filling:
Name     0
Age      0
Marks    0
dtype: int64

Number of students with missing marks before filling: 0


# **Handling Outliers:**

Outliers are data points that significantly differ from other observations in a dataset. They can distort statistical analyses and negatively impact the performance of machine learning models. Detecting and removing outliers is an important part of data cleaning and preprocessing.

**Handling Outliers in a Dataset:**

In this notebook, we use a sample dataset containing student information with some extreme values in the 'Age' and 'Marks' columns.

We will perform the following steps:

1-dentify outliers in the 'Age' column using the Interquartile Range (IQR) method.

2-Remove the detected outliers from the dataset.

3-Repeat the same process to detect and remove outliers in the 'Marks' column.

4-Display the cleaned dataset after outlier removal.

Using the IQR method helps us retain most of the data while filtering out values that fall too far outside the normal range.

 **Explanation:**

* Detects outliers in "Age" and "Marks" using the IQR (Interquartile Range) method.

* Removes rows with outlier values.

* Prints the cleaned dataset.



In [None]:
import pandas as pd

# Sample DataFrame with some outliers
data = {
    'Name': ['Iqra', 'Sara', 'Eman', 'Saba', 'Noor', 'Hajra'],
    'Age': [23, 25, 100, 24, 22, 21],   # 100 is an outlier
    'Marks': [88, 92, 85, 300, 87, 89]  # 300 is an outlier
}

df = pd.DataFrame(data)

# 1. Detecting outliers in 'Age' using IQR method
Q1_age = df['Age'].quantile(0.25)
Q3_age = df['Age'].quantile(0.75)
IQR_age = Q3_age - Q1_age

# 2. Filtering out the outliers in Age
df = df[(df['Age'] >= Q1_age - 1.5 * IQR_age) & (df['Age'] <= Q3_age + 1.5 * IQR_age)]

# 3. Detecting and removing outliers in 'Marks'
Q1_marks = df['Marks'].quantile(0.25)
Q3_marks = df['Marks'].quantile(0.75)
IQR_marks = Q3_marks - Q1_marks

df = df[(df['Marks'] >= Q1_marks - 1.5 * IQR_marks) & (df['Marks'] <= Q3_marks + 1.5 * IQR_marks)]

# 4. Final clean data after removing outliers
print("Data after removing outliers:")
print(df)


Data after removing outliers:
    Name  Age  Marks
0   Iqra   23     88
1   Sara   25     92
4   Noor   22     87
5  Hajra   21     89


# **Categorical Encoding:**

Categorical data refers to variables that contain label values rather than numeric values. Machine learning models usually require all input features to be numeric, so converting categorical data into a numerical format is an essential preprocessing step.

**Categorical Encoding in a Dataset:**

In this notebook, we work with a sample dataset that includes a categorical column: 'Gender'. We demonstrate two popular encoding techniques:

**Label Encoding:** Converts categories into numeric labels (e.g., Male = 1, Female = 0). This method is simple but may introduce unintended ordinal relationships.

**One-Hot Encoding:** Creates separate binary columns for each category, avoiding any implied order between them. This is the preferred method for nominal (unordered) categorical data.

These encoding methods prepare categorical variables for use in machine learning algorithms.

**Explanation:**

* map() is used for simple label encoding.

* pd.get_dummies() is used for one-hot encoding, which turns the "Gender" column into "Gender_Female" and "Gender_Male".

* You can apply the same method to other categorical columns like "Grade", "Subject", etc



In [None]:
import pandas as pd

# Sample data with a categorical column
data = {
    'Name': ['Iqra', 'Sara', 'Eman', 'Saba', 'Noor'],
    'Gender': ['Female', 'Male', 'Male', 'Male', 'Female'],  # Categorical
    'Age': [23, 25, 22, 24, 21],
    'Marks': [88, 92, 85, 90, 87]
}

df = pd.DataFrame(data)

# 1. Label Encoding for 'Gender' (Male = 1, Female = 0)
df['Gender_Label'] = df['Gender'].map({'Male': 1, 'Female': 0})

# 2. One-Hot Encoding for 'Gender'
df_onehot = pd.get_dummies(df, columns=['Gender'])

# 3. Show original with label and one-hot encoding
print("Data with Label Encoding:")
print(df[['Name', 'Gender', 'Gender_Label']])

print("\nData with One-Hot Encoding:")
print(df_onehot)

Data with Label Encoding:
   Name  Gender  Gender_Label
0  Iqra  Female             0
1  Sara    Male             1
2  Eman    Male             1
3  Saba    Male             1
4  Noor  Female             0

Data with One-Hot Encoding:
   Name  Age  Marks  Gender_Label  Gender_Female  Gender_Male
0  Iqra   23     88             0           True        False
1  Sara   25     92             1          False         True
2  Eman   22     85             1          False         True
3  Saba   24     90             1          False         True
4  Noor   21     87             0           True        False


# **Normalization and Standardization:**

When working with numerical data, it's important to scale the values so that they are on a similar range. This is especially critical for many machine learning algorithms that are sensitive to the scale of input features.

**Normalization and Standardization of Data:**

In this notebook, we demonstrate two common data scaling techniques:

**Normalization:** Also known as Min-Max Scaling, this technique rescales the data to a fixed range, usually 0 to 1. It is useful when the data needs to be bounded and evenly scaled.

**Standardization:** This technique transforms the data to have a mean of 0 and a standard deviation of 1. It is useful when the data follows a normal distribution or when algorithms assume a Gaussian distribution (e.g., logistic regression, SVM).

We apply both methods to the 'Age' and 'Marks' columns in a sample dataset to compare the results. These techniques ensure that features contribute equally to the model performance and improve convergence speed in training.

**Explanation:**

* MinMaxScaler transforms data between 0 and 1 (Normalization).

* StandardScaler transforms data to have mean 0 and standard deviation 1 (Standardization).

We apply both on the numeric columns Age and Marks.


In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Sample data
data = {
    'Name': ['Iqra', 'Sara', 'Eman', 'Saba', 'Noor'],
    'Age': [23, 25, 22, 24, 21],
    'Marks': [88, 92, 85, 90, 87]
}

df = pd.DataFrame(data)

# 1. Normalization (scales data between 0 and 1)
scaler_norm = MinMaxScaler()
df[['Age_Norm', 'Marks_Norm']] = scaler_norm.fit_transform(df[['Age', 'Marks']])

# 2. Standardization (mean = 0, std = 1)
scaler_std = StandardScaler()
df[['Age_Std', 'Marks_Std']] = scaler_std.fit_transform(df[['Age', 'Marks']])

# 3. Show the result
print("Data with Normalization and Standardization:")
print(df)

Data with Normalization and Standardization:
   Name  Age  Marks  Age_Norm  Marks_Norm   Age_Std  Marks_Std
0  Iqra   23     88      0.50    0.428571  0.000000  -0.165521
1  Sara   25     92      1.00    1.000000  1.414214   1.489691
2  Eman   22     85      0.25    0.000000 -0.707107  -1.406930
3  Saba   24     90      0.75    0.714286  0.707107   0.662085
4  Noor   21     87      0.00    0.285714 -1.414214  -0.579324
