### Understanding the Problem and the Data

##### The first step in any data analysis project is to fully understand the problem we're solving and the data we have. This includes asking key questions like:

1) What is the business goal or research question?
   
2) What are the variables in the data and what do they represent?
  
3) What types of data (numerical, categorical, text, etc.) do you have?

4) Are there any known data quality issues or limitations?

5) Are there any domain-specific concerns or restrictions?


## Missing Values 

### 1) Drop the rows 
- Use this when the missing rows are few and won't significatly reduce your dataset 

In [None]:
df.dropna(inplace = True)

### 2) Fill in the missing values (Imputation)
- Use this when you can't affor to lose rows 

In [None]:
df.fillna(df.mean(), inplace = True) # for numerical columns 
df.fillna(df.mode()[0], inplace = True) #for categorical columns 

### Observations:
- X missing values were found in [column name], Since the missing values represent a small percentage of the dataset, they were dropped/filled with the mean to preserve data intergrity

- Also, if missing values are less than 5% of your data you can safely drop them. if they are more than 5% you should impute(fill them in) rather than lose too much data.

### 3) Forward and backward fill

In [None]:
forward_fill = df["Marks"].fillna(method = "ffill")
backward_fill = df["Marks"].fillna(method = "bfill")

- This method is useful for ordered or time series data
- Forward fill uses the last valid observation to fill missing values.
- Backward fill uses the next valid observation to fill missing value

Pros: 
preserves order and patterns in data

Cons:
may be inaccurate when gaps are large or values differ significantly

### Types of Missing Values
MCAR (Missing Completely at Random): Missingness occurs randomly and is not related to any variable in the dataset.

MAR (Missing at Random): Missingness depends on other observed variables, not on the missing value itself.

MNAR (Missing Not at Random): Missingness is directly related to the value that is missing (e.g high-income individuals not reporting income).

------------------------------------------------------------------------------------

## Duplicates

In [None]:
df.drop_duplicates(inplace = True)

### Observations:
- X duplicate rows were found and removed, reducing the dataset from 1000 to [new number] rows. This ensures each record represents a unique student and prevent results from being skewed.

------------------------------------------------------------------------------------

### Important:
`inplace = True` means make the change directly to the original dataframe without needing to reassign it