# Weather Classification with Decision Trees

This notebook performs a supervised classification pipeline to predict the weather condition from meteorological features. It includes data loading, exploratory analysis, preprocessing, model training with Decision Trees (varying max_depth), and analysis of overfitting vs generalization.

**Data source note:** No 'weather.csv' found. Synthetic dataset created for demonstration.


In [None]:
# Load dataset (this cell replicates what was run in the notebook environment)
import pandas as pd
df = pd.read_csv('weather.csv') if __import__('os').path.exists('weather.csv') else None
print('df is None -> dataset file not found in working directory' if df is None else 'Loaded')

## Data Loading & Overview

- Show top records and describe the structure.

In [None]:
df = pd.read_csv('weather.csv') if __import__('os').path.exists('weather.csv') else None

if df is None:
    print('No weather.csv found in working directory. This notebook used a synthetic dataset created programmatically.')
else:
    display(df.head())
    print('\nInfo:')
    df.info()

## Exploratory Analysis

- Distribution of the target classes and simple quality checks.

In [None]:
display(df.head())
print('\nValue counts for target:')
print(df['Weather'].value_counts())

print('\nMissing values per column:')
print(df.isnull().sum())

print('\nNumber of duplicate rows:')
print(df.duplicated().sum())

## Preprocessing

- Separate features and target, perform train-test split, and set up preprocessing pipelines for numeric and categorical features.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

X = df.drop(columns=['Weather'])
y = df['Weather']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=42)

cat_cols = X.select_dtypes(include=['object','category']).columns.tolist()
num_cols = X.select_dtypes(include=[np.number]).columns.tolist()
print('Numeric cols:', num_cols)
print('Categorical cols:', cat_cols)

## Modeling — Decision Trees

- Train Decision Tree models with multiple `max_depth` values (1–9 and None). Record train and test accuracy for each configuration.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

numeric_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')),('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))])
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, num_cols),('cat', categorical_transformer, cat_cols)])

results = []
depths = list(range(1,10)) + [None]
for d in depths:
    clf = Pipeline(steps=[('pre', preprocessor),('clf', DecisionTreeClassifier(max_depth=d, random_state=42))])
    clf.fit(X_train, y_train)
    y_tr = clf.predict(X_train)
    y_te = clf.predict(X_test)
    results.append({'max_depth': d,'train_acc': accuracy_score(y_train, y_tr),'test_acc': accuracy_score(y_test, y_te)})

import pandas as pd
results_df = pd.DataFrame(results)
display(results_df)

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(8,5))
x_labels = [str(d) for d in depths]
plt.plot(x_labels, results_df['train_acc'], marker='o', label='Train accuracy')
plt.plot(x_labels, results_df['test_acc'], marker='o', label='Test accuracy')
plt.xlabel('max_depth')
plt.ylabel('Accuracy')
plt.title('Decision Tree: train vs test accuracy by max_depth')
plt.legend()
plt.grid(True)
plt.show()

## Analysis & Conclusion

- Compare how model depth affects overfitting/generalization.
- State which depth performs best and why.

(See the test accuracy column in results to determine best depth. A shallow tree may underfit; a very deep tree may overfit — choose a depth with high test accuracy and minimal gap to train accuracy.)