# Problem 1: Predicting Average Temperature in India

This notebook implements the first problem statement: developing a predictive model to estimate the month-wise average temperature of India using Linear Regression.

### Task 1: Setup and Data Loading

First, we import the necessary libraries and load the dataset. We are using a dataset of monthly mean temperatures in India from 1901 to 2017.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# Set plot style
sns.set(style="whitegrid")

In [None]:
# Load the dataset from the local CSV file provided.
file_path = 'D:\ml\LP-I\Temperatures of India.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataframe
print("First 5 rows of the dataset:")
df.head()

### Task 2: Data Pre-processing and Exploration

We will perform some basic pre-processing. The goal is to predict the average annual temperature. We will create an 'ANNUAL' average column if it doesn't exist and use 'YEAR' as the feature. We'll also check for any missing values.

In [None]:
# Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())

# For simplicity, we will predict the annual average temperature based on the year.
# Let's drop rows with missing 'ANNUAL' temperature data.
df_cleaned = df.dropna(subset=['ANNUAL'])

# Verify that the rows have been dropped
print("\nMissing values after cleaning:")
print(df_cleaned.isnull().sum())

Now, let's define our features (X) and target (y).

In [None]:
# Define features (X) and target (y)
# We use 'YEAR' to predict the 'ANNUAL' temperature.
X = df_cleaned[['YEAR']] # Features must be a 2D array
y = df_cleaned['ANNUAL']   # Target variable

# Split the data into training and testing sets
# We'll use 80% for training and 20% for testing.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Task 3: Apply Linear Regression

We will now create an instance of the Linear Regression model and train it using our training data.

In [None]:
# Create a Linear Regression model instance
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

print("Linear Regression model trained successfully!")

### Task 4: Predict Temperature and Assess Performance

With the model trained, we can make predictions on our test data and then evaluate the model's performance using MSE, MAE, and R-squared.

In [None]:
# Make predictions on the test data
y_pred = model.predict(X_test)

# Assess model performance
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
r2 = metrics.r2_score(y_test, y_pred)

print("Model Performance Evaluation:")
print(f'Mean Absolute Error (MAE): {mae:.2f}')
print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'R-squared (R²): {r2:.2f}')

### Task 5: Visualize the Regression Model

Finally, we'll visualize the results. We will plot the actual temperature data points and overlay our linear regression model's prediction line to see how well it fits the data.

In [None]:
plt.figure(figsize=(12, 6))

# Scatter plot for the actual data points (test set)
plt.scatter(X_test, y_test, color='blue', label='Actual Temperature')

# Line plot for the predicted values
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted Temperature (Regression Line)')

plt.title('Temperature in India: Actual vs. Predicted')
plt.xlabel('Year')
plt.ylabel('Annual Average Temperature (°C)')
plt.legend()
plt.show()

### Predicting Temperature Trends Month-wise

The problem statement also asks to predict trends month-wise. We can do this by creating a separate model for each month.

In [None]:
months = df_cleaned.columns[1:13] # JAN to DEC
monthly_models = {}
monthly_performance = {}

for month in months:
    # Create a dataframe for the specific month, dropping NaNs
    month_df = df_cleaned[['YEAR', month]].dropna()
    
    # Define features and target
    X_month = month_df[['YEAR']]
    y_month = month_df[month]
    
    # Split data
    X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X_month, y_month, test_size=0.2, random_state=42)
    
    # Create and train the model
    model_month = LinearRegression()
    model_month.fit(X_train_m, y_train_m)
    monthly_models[month] = model_month
    
    # Evaluate performance
    y_pred_m = model_month.predict(X_test_m)
    r2_m = metrics.r2_score(y_test_m, y_pred_m)
    monthly_performance[month] = r2_m
    
    print(f"Model for {month} trained. R-squared: {r2_m:.2f}")

### Visualize Monthly Trends

Let's visualize the regression line for a few months to see the trend.

In [None]:
fig, axes = plt.subplots(4, 3, figsize=(20, 18), sharey=True)
axes = axes.flatten()

for i, month in enumerate(months):
    ax = axes[i]
    
    # Get the data for the month
    month_df = df_cleaned[['YEAR', month]].dropna()
    X_month = month_df[['YEAR']]
    y_month = month_df[month]
    
    # Get the trained model
    model_month = monthly_models[month]
    
    # Plot actual data
    ax.scatter(X_month, y_month, alpha=0.5, label='Actual Data')
    
    # Plot regression line
    y_pred_line = model_month.predict(X_month)
    ax.plot(X_month, y_pred_line, color='red', label='Regression Line')
    
    ax.set_title(f'Temperature Trend for {month}')
    ax.set_xlabel('Year')
    ax.set_ylabel('Temperature (°C)')
    ax.legend()

plt.tight_layout()
plt.show()

### Conclusion

We have successfully implemented a simple linear regression model to predict both annual and monthly temperature trends in India. 

**Code Quality and Clarity:**
- The code is organized into logical tasks as specified in the problem statement.
- Comments are used to explain each step, from data loading to visualization.
- Variable names are descriptive (`X_train`, `y_pred`, `model_month`).
- The implementation uses standard, well-regarded libraries (`pandas`, `scikit-learn`, `matplotlib`).

**Potential Improvements:**
- **Feature Engineering:** Instead of just using 'YEAR', we could create more complex features. For example, using time-series specific models (like ARIMA) or polynomial regression might capture the non-linear trends more effectively.
- **More Complex Models:** For a more accurate prediction, one could use more advanced regression models like Random Forest Regressor or Gradient Boosting Regressor.
- **Month as a Feature:** A single model could be built to predict temperature by using both 'YEAR' and 'MONTH' as features. This would require encoding the 'MONTH' column (e.g., one-hot encoding).