<a href="https://colab.research.google.com/github/MehrdadJalali-AI/Statistics-and-Machine-Learning/blob/main/exam_tips_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Handling Missing Values
### Problem Definition:
The dataset contains missing values in some columns. These missing values may introduce bias into the analysis, so the problem is to handle them effectively.

In [1]:
import pandas as pd
import numpy as np

# Example dataset
data = {'Column1': [1, 2, np.nan, 4],
        'Column2': [np.nan, 2, 3, 4]}
df = pd.DataFrame(data)

# Checking missing values
print('Before handling missing values:\n', df)

# Imputing missing values with the mean
df.fillna(df.mean(), inplace=True)

print('After handling missing values:\n', df)

Before handling missing values:
    Column1  Column2
0      1.0      NaN
1      2.0      2.0
2      NaN      3.0
3      4.0      4.0
After handling missing values:
     Column1  Column2
0  1.000000      3.0
1  2.000000      2.0
2  2.333333      3.0
3  4.000000      4.0


### Justification:
We imputed the missing values with the mean of each column to preserve the dataset's size and minimize the impact of missing values on the analysis. This method works well for numeric data.

# Classification Example
### Problem Definition:
We want to classify data points into different categories based on the features provided. This task can help identify patterns in the dataset.

In [2]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Classification Accuracy: {accuracy:.2f}')

Classification Accuracy: 1.00


### Justification:
We used a Random Forest Classifier because it is robust and handles both linear and non-linear data well. The number of estimators was set to 100 for a good balance between performance and computational cost.

### Submission Instructions:
Once you complete your notebook:
- Save it as a PDF (File > Print > Save as PDF).
- Upload the PDF to Moodle within the given time frame.