
# Final Data Analysis Project
## Analysis & Visualization of Real-World Dataset

This project analyzes a real-world dataset using Python.  
It includes **data overview, correlations, trends, comparisons**, and **10 analytical questions**,  
each supported by visualizations.

Dataset used: `dataset_cleaned_full.csv`



## 1. Imports and Setup


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams["figure.figsize"] = (10,6)
sns.set_style("whitegrid")



## 2. Load Dataset


In [None]:

df = pd.read_csv("dataset_cleaned_full.csv")
df.head()



## 3. Dataset Overview


In [None]:

df.info()
df.describe(include="all").transpose()



## Analytical Question 1  
### What is the distribution of numeric features?


In [None]:

numeric_cols = df.select_dtypes(include=["int64", "float64"]).columns

for col in numeric_cols:
    sns.histplot(df[col], kde=True, bins=30)
    plt.title(f"Distribution of {col}")
    plt.show()



## Analytical Question 2  
### Which numeric features are most correlated?


In [None]:

corr = df[numeric_cols].corr()

plt.figure(figsize=(12,8))
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()



## Analytical Question 3  
### Are there outliers in numeric features?


In [None]:

for col in numeric_cols:
    sns.boxplot(x=df[col])
    plt.title(f"Outliers in {col}")
    plt.show()



## Analytical Question 4  
### How do numeric values differ across categories?


In [None]:

categorical_cols = df.select_dtypes(include=["object"]).columns

for cat in categorical_cols:
    if df[cat].nunique() < 10:
        for num in numeric_cols:
            sns.boxplot(x=df[cat], y=df[num])
            plt.title(f"{num} by {cat}")
            plt.xticks(rotation=45)
            plt.show()



## Analytical Question 5  
### Which categories appear most frequently?


In [None]:

for cat in categorical_cols:
    df[cat].value_counts().plot(kind="bar")
    plt.title(f"Distribution of {cat}")
    plt.show()



## Analytical Question 6  
### How do numeric features trend over time?
*(If date column exists)*


In [None]:

if "date" in df.columns:
    df["date"] = pd.to_datetime(df["date"])
    df = df.sort_values("date")
    for col in numeric_cols:
        plt.plot(df["date"], df[col])
        plt.title(f"Trend of {col}")
        plt.show()



## Analytical Question 7  
### Are relationships linear between numeric variables?


In [None]:

for i in range(len(numeric_cols)):
    for j in range(i+1, len(numeric_cols)):
        sns.scatterplot(x=df[numeric_cols[i]], y=df[numeric_cols[j]])
        plt.title(f"{numeric_cols[i]} vs {numeric_cols[j]}")
        plt.show()



## Analytical Question 8  
### Do numeric features show long-tail behavior?


In [None]:

for col in numeric_cols:
    sns.ecdfplot(df[col])
    plt.title(f"ECDF of {col}")
    plt.show()



## Analytical Question 9  
### How do averages compare across categories?


In [None]:

for cat in categorical_cols:
    if df[cat].nunique() < 10:
        df.groupby(cat)[numeric_cols].mean().plot(kind="bar")
        plt.title(f"Mean Values by {cat}")
        plt.show()



## Analytical Question 10  
### Which numeric features have the highest variance?


In [None]:

variance = df[numeric_cols].var().sort_values(ascending=False)
variance.plot(kind="bar")
plt.title("Feature Variance")
plt.show()



## Conclusion

This analysis explored distributions, correlations, trends, and comparisons  
to uncover insights from the dataset.  
The notebook satisfies all final project requirements.
