# Problem Statement: Analyze crop production data to identify trends in yield across states and crop types over years

**Project ID:** CBIT/IT-1/EDAV/2025/CEP-01  
**Roll No:** 1601-24-737-302  
**Student Name:** Ramulapenta Ramakotesh  
**Department:** Dept of IT, Chaitanya Bharathi Institute of Technology, Hyderabad  



## Problem Overview

The goal of this project is to analyze crop production data to identify trends in yield across Indian states and crop types over multiple years.  
Key tasks include:
- Calculating average crop yields.
- Filtering the dataset for particular crops and states.
- Handling missing yield values using imputation strategies.
- Grouping by year and crop to compute total production and visualize trends.

This notebook follows the EDAV assignment requirements and will provide reproducible code, outputs, and short explanations.


## Dataset

**Dataset file (Google Drive):** https://drive.google.com/file/d/1QZlYnqAMQEhCtj77JlUV4MDXRpBtQ1iK/view?usp=drive_link



## Environment

- Python version: 3.x (Colab / local environment)
- Libraries used:
  - numpy
  - pandas
  - matplotlib
  - seaborn

(If running in Google Colab, the standard runtime already contains these libraries. If running locally, install with `pip install numpy pandas matplotlib seaborn`.)


In [4]:
#Connect The Data Set To The NoteBook
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('/content/drive/MyDrive/colab_datasets/crop_data.csv')
df.head(6)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,Year,State,Crop,Area_ha,Production_tonnes,Yield_t_per_ha
0,2015,Punjab,Wheat,8270,23444.868232,2.83493
1,2015,Punjab,Rice,6191,16275.261538,2.628858
2,2015,Punjab,Maize,6734,24466.435213,3.633269
3,2015,Punjab,Sugarcane,6578,454743.518937,69.13097
4,2015,Uttar Pradesh,Wheat,9322,26653.067932,
5,2015,Uttar Pradesh,Rice,6311,16633.524109,2.63564


## Q1: Calculate average crop yield using numpy arrays (Bloomâ€™s Level: 3)

In this question, we calculate the **average crop yield** using NumPy arrays.  
NumPy provides efficient mathematical operations over arrays, and it allows handling of missing values using functions like `np.nanmean()`.

### ðŸ”¹ Concept
- `numpy.array()` â†’ converts a Pandas Series or list into a NumPy array.
- `numpy.nanmean(array)` â†’ returns the mean value ignoring `NaN` (missing) entries.



In [7]:
# Import necessary library
import numpy as np

# Convert the Yield column to a NumPy array
yield_array = np.array(df["Yield_t_per_ha"])

# Calculate the average yield using numpy's nanmean (ignores NaN values)
average_yield = np.nanmean(yield_array)

# Display the result
print("âœ… Average Crop Yield (tonnes per hectare):", average_yield)


âœ… Average Crop Yield (tonnes per hectare): 19.276636629219798


### ðŸ§  Explanation

1. **`np.array()`** â€“ converts the `Yield_t_per_ha` column from a Pandas DataFrame into a NumPy array for numerical operations.  
   Syntax: `array_name = np.array(dataframe['column_name'])`

2. **`np.nanmean(array)`** â€“ computes the mean while ignoring missing (`NaN`) values.  
   Syntax: `np.nanmean(array_name)`

3. The result represents the **average yield** (in tonnes per hectare) across all states, crops, and years.

### âœ… Output Interpretation
The printed number indicates the average productivity (yield) of all crops over all recorded years.  
It gives an idea of the **overall agricultural efficiency** across India in the dataset.


## Q2: Filter data for a specific crop and state using pandas (Bloomâ€™s Level: 3)

In this question, we will **filter** the dataset for a particular crop and state using **Pandas**.

Filtering helps us focus only on the required subset of data.  
For example, we can view all records where the crop is *Rice* and the state is *Punjab*.

### ðŸ”¹ Concept
- `df[df["Column"] == "Value"]` â†’ filters rows where column equals a given value.
- Multiple conditions can be combined using:
  - `&` (AND)
  - `|` (OR)
  
**Example Syntax:**
```python
filtered_data = df[(df["Crop"] == "Rice") & (df["State"] == "Punjab")]


In [8]:
# Filter the dataset for a specific crop and state
filtered_df = df[(df["Crop"] == "Rice") & (df["State"] == "Punjab")]

# Display first few rows of filtered data
print("Filtered Data for Rice in Punjab:\n")
display(filtered_df.head())

# Display how many records matched
print("Total records found:", len(filtered_df))

Filtered Data for Rice in Punjab:



Unnamed: 0,Year,State,Crop,Area_ha,Production_tonnes,Yield_t_per_ha
1,2015,Punjab,Rice,6191,16275.261538,2.628858
21,2016,Punjab,Rice,4943,12838.63792,
41,2017,Punjab,Rice,3731,10478.044144,2.808374
61,2018,Punjab,Rice,7776,16890.785705,2.172169
81,2019,Punjab,Rice,1197,2572.996069,2.149537


Total records found: 10


## Q3: Handle missing yield values in dataset (Bloomâ€™s Level: 4)

In this question, we will **handle missing values** in the `Yield_t_per_ha` column.  
Missing data can cause errors or inaccurate results during analysis.  

We will fill missing yields using **the mean of the column** as a simple imputation method.

### ðŸ”¹ Concept
- `df["Column"].isna()` â†’ checks which values are missing.
- `df["Column"].fillna(value, inplace=True)` â†’ fills missing values with a specific value (here, mean).

**Why fill missing values?**
- Ensures accurate calculations for averages, sums, and visualizations.
- Avoids errors when applying numerical functions.


In [13]:
# Check how many missing values are in Yield column
missing_before = df["Yield_t_per_ha"].isna().sum()
print("Missing values before handling:", missing_before)

# Fill missing values with mean of the column
mean_yield = df["Yield_t_per_ha"].mean()  # calculate mean ignoring NaN
df["Yield_t_per_ha"].fillna(mean_yield, inplace=True)

# Check missing values after filling
missing_after = df["Yield_t_per_ha"].isna().sum()
print("Missing values after handling:", missing_after)


Missing values before handling: 0
Missing values after handling: 0


### ðŸ§  Explanation

1. **Check missing values:** `isna().sum()` counts how many rows have missing yields.  
2. **Compute mean:** `df["Yield_t_per_ha"].mean()` calculates the average yield ignoring NaN.  
3. **Fill missing:** `fillna(mean_value, inplace=True)` replaces all NaNs with the computed mean.  

### âœ… Output Interpretation
- `Missing values before handling` â†’ shows how many rows were empty.  
- `Missing values after handling` â†’ should be 0, meaning all missing values are filled.  
- Now, the dataset is ready for accurate calculations in later analysis.


## Q4: Group by year and crop type to find total production (Bloomâ€™s Level: 4)

In this question, we will **group the dataset** by `Year` and `Crop` to calculate **total production**.  

Grouping allows us to summarize data and see trends across years and crops.

### ðŸ”¹ Concept
- `df.groupby(["Column1", "Column2"])["ColumnToAggregate"].sum()` â†’ groups data by specified columns and calculates sum for each group.  
- `reset_index()` â†’ converts grouped data back into a regular DataFrame for easier viewing.

**Example Syntax:**
```python
grouped = df.groupby(["Year", "Crop"])["Production_tonnes"].sum().reset_index()


In [16]:
# Group data by Year and Crop to find total production
grouped_production = df.groupby(["Year", "Crop"])["Production_tonnes"].sum().reset_index()

# Display first few rows of grouped data
print("Total Production by Year and Crop:\n")
display(grouped_production.head())

Total Production by Year and Crop:



Unnamed: 0,Year,Crop,Production_tonnes
0,2015,Maize,99087.62
1,2015,Rice,71925.53
2,2015,Sugarcane,1788968.0
3,2015,Wheat,129908.4
4,2016,Maize,140182.5


### ðŸ§  Explanation

1. **Group by columns:** `groupby(["Year","Crop"])` groups all rows with same Year and Crop together.  
2. **Sum production:** `["Production_tonnes"].sum()` calculates total production for each group.  
3. **Reset index:** `reset_index()` makes the grouped data a normal DataFrame again.  
4. The resulting table shows **total production** of each crop for each year.

### âœ… Output Interpretation
- You can now see how production of each crop changes year by year.  



## Final Notes / Conclusion

- The dataset was successfully loaded and cleaned. Missing values in the yield column were handled by filling with the mean.  
- **Q1:** Calculated the average crop yield across all states, crops, and years.  
- **Q2:** Filtered the data for a specific crop and state to focus on targeted analysis.  
- **Q3:** Handled missing yield values to ensure accurate calculations.  
- **Q4:** Grouped data by Year and Crop to calculate total production, revealing trends over time.  

### Observations:
- Crop yields vary across states and crops, showing potential areas for improvement in agricultural practices.  
- Grouped production data can be further used for visualization or forecasting future trends.  

### Notes:
- The notebook is fully executable from top to bottom without errors.  
- All steps are reproducible, and intermediate results have been saved to Google Drive for reference.  
- Handling missing data with simple mean imputation ensured calculations could proceed without affecting overall trends.
