<a href="https://colab.research.google.com/github/raamponsah/data-analytics-workshops/blob/main/Mini_Project_with_Linear_Regression_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Mini Project: Agricultural Yield Prediction with Linear Regression**

## **Objective:**
Predict crop yield based on various agricultural factors using Linear Regression.

---

## **Step 1: Load and Explore the Dataset**

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load dataset
url = "https://raw.githubusercontent.com/dsrscientist/dataset1/master/crop_production.csv"
df = pd.read_csv(url)

# Quick overview
print(df.head())
print(df.info())
print(df.describe())
```

✅ **Insight:** Dataset contains information on state, district, crop, area, production, and season.

---

## **Step 2: Data Cleaning & Preprocessing**

```python
# Convert column names to lowercase for consistency
df.columns = df.columns.str.lower()

# Check for missing values
print(df.isnull().sum())

# Drop rows with missing values
df = df.dropna()

# Encode categorical variables
categorical_cols = ['state_name', 'district_name', 'crop', 'season']
df_encoded = df.copy()

for col in categorical_cols:
    df_encoded[col] = df_encoded[col].astype('category').cat.codes
```

✅ **Data is clean and ready for analysis!**

---

## **Step 3: Exploratory Data Analysis (EDA)**

### **1️⃣ How is area related to production?**

```python
sns.scatterplot(x=df['area'], y=df['production'])
plt.title("Area vs Production")
plt.xlabel("Area (Hectares)")
plt.ylabel("Production (Metric Tons)")
plt.show()
```

✅ **Insight:** Larger farming areas generally lead to higher production.

---


### **2️⃣ Which crops have the highest production?**

```python
plt.figure(figsize=(12,5))
sns.barplot(x=df['crop'], y=df['production'], order=df.groupby('crop')['production'].sum().sort_values(ascending=False).index)
plt.xticks(rotation=45)
plt.title("Total Production by Crop Type")
plt.show()
```

✅ **Insight:** Some crops have significantly higher yields than others.

---

## **Step 4: Feature Selection & Model Preparation**

```python
# Define features and target
X = df_encoded[['area', 'state_name', 'district_name', 'crop', 'season']]
y = df_encoded['production']

# Split into train & test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)
```

✅ **Data split into training and testing sets.**

---

## **Step 5: Train a Linear Regression Model**

```python
# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error: {mae}")
print(f"Mean Squared Error: {mse}")
print(f"R-squared Score: {r2}")
```

✅ **R-squared score indicates how well the model explains production variance.**

---


# **Final Observations & Next Steps**

📌 **Area strongly influences crop production.**  
📌 **Linear regression provides a reasonable prediction but can be improved.**  
📌 **Consider adding soil type, weather, and irrigation data for better accuracy.**  

🔹 **Next Steps:**  
- Try **Polynomial Regression** to capture non-linearity.  
- Use **Random Forest** or **Gradient Boosting** for better predictions.  
- Test on unseen agricultural datasets.  