## **Kenya Regional Crop Yield Prediction Using Machine Learning**

**1. Business Understanding**

**Background**

Agriculture remains one of Kenya’s most critical economic sectors, contributing significantly to GDP, employment, food security, and rural livelihoods. However, crop production in Kenya is highly vulnerable to climate variability, input usage differences, and regional production disparities.

Unpredictable rainfall patterns, temperature fluctuations, and evolving agricultural practices make traditional yield estimation unreliable. Policymakers and agricultural planners often rely on historical trends rather than predictive intelligence, limiting proactive decision-making.

This project applies machine learning techniques to forecast crop yields across major Kenyan regions using historical agricultural production, climate indicators, and pesticide usage data.

## **Problem Statement**

The objective of this project is to develop a supervised machine learning model capable of predicting regional crop yield in Kenya using historical production data, climate variables, and agricultural input usage.

**Objectives**

1. How accurately can regional crop yields be predicted using historical climate and input variables?

2. Which regions are most sensitive to rainfall and temperature variability?

3. How do pesticide usage and harvested area influence yield outcomes?

4. Can yield prediction models provide early warning signals for potential food shortages?

## Cleaning Kenya Crop File

In [42]:
#loading the dataset
import pandas as pd 

df = pd.read_csv(r"C:\Users\Administrator\Desktop\phase 5\Regional_Crop_Yield_Prediction_Kenya\data\kenya_crops_only.csv")
df.head()

Unnamed: 0,Year,Item,Item Code (CPC),Area_Harvested_ha,Production_tonnes,Yield_hg_per_ha
0,1961,Apricots,1343.0,2.0,10.0,50000.0
1,1961,Avocados,1311.0,1100.0,16000.0,145455.0
2,1961,Bananas,1312.0,40000.0,400000.0,100000.0
3,1961,Barley,115.0,12666.0,13513.0,10669.0
4,1961,"Beans, dry",1701.0,115000.0,55000.0,4783.0


In [43]:
#checking the dataframe and if it has missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3658 entries, 0 to 3657
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Year               3658 non-null   int64  
 1   Item               3658 non-null   object 
 2   Item Code (CPC)    3658 non-null   float64
 3   Area_Harvested_ha  3377 non-null   float64
 4   Production_tonnes  3658 non-null   float64
 5   Yield_hg_per_ha    3317 non-null   float64
dtypes: float64(4), int64(1), object(1)
memory usage: 171.6+ KB


In [44]:
#filling missing values with mean for the numerical columns
# Select numeric columns 
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns

# Fill missing values with column mean
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())


In [45]:
#checking if missing values still exist
df.isnull().sum()


Year                 0
Item                 0
Item Code (CPC)      0
Area_Harvested_ha    0
Production_tonnes    0
Yield_hg_per_ha      0
dtype: int64

## Cleaning Pesticides File 

In [46]:
#loading and reading the dataset
pest_df = pd.read_csv(r"C:\Users\Administrator\Desktop\phase 5\Regional_Crop_Yield_Prediction_Kenya\data\environment_pesticides_e_all_data.csv", sep=";")
pest_df.head()


Unnamed: 0,Year,Area,Item,Element,Value,Unit,Area Code (M49),Note
0,2005,United Kingdom of Great Britain and Northern I...,Pesticides (total),Use per area of cropland,5.44,kg/ha,826-01-01,Estimated Value
1,2012,United Kingdom of Great Britain and Northern I...,Pesticides (total),Use per area of cropland,2.83,kg/ha,826-01-01,Estimated Value
2,2021,United Kingdom of Great Britain and Northern I...,Pesticides (total),Use per area of cropland,2.44,kg/ha,826-01-01,Estimated Value
3,1993,United Kingdom of Great Britain and Northern I...,Pesticides (total),Use per capita,0.56,kg/cap,826-01-01,Estimated Value
4,2002,United Kingdom of Great Britain and Northern I...,Pesticides (total),Use per capita,0.52,kg/cap,826-01-01,Estimated Value


In [47]:
#checking for dataset information
pest_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   Year             100004 non-null  int64  
 1   Area             100004 non-null  object 
 2   Item             100004 non-null  object 
 3   Element          100004 non-null  object 
 4   Value            100004 non-null  float64
 5   Unit             100004 non-null  object 
 6   Area Code (M49)  99674 non-null   object 
 7   Note             100004 non-null  object 
dtypes: float64(1), int64(1), object(6)
memory usage: 6.1+ MB


In [48]:
#checking for different names under area column and that Kenya exists
pest_df["Area"].unique()

array(['United Kingdom of Great Britain and Northern Ireland',
       'United Republic of Tanzania', 'United States of America',
       'Uruguay', 'USSR', 'Vanuatu', 'Venezuela (Bolivarian Republic of)',
       'Viet Nam', 'Wallis and Futuna Islands', 'Yemen', 'Zambia',
       'Zimbabwe', 'Philippines', 'Pitcairn', 'Poland', 'Portugal',
       'Qatar', 'Republic of Korea', 'Republic of Moldova',
       'Eastern Africa', 'Middle Africa', 'Southern Africa',
       'Western Africa', 'Americas', 'Northern America',
       'Central America', 'Caribbean', 'South America', 'Asia',
       'Central Asia', 'Cyprus', 'Czechia', 'Israel', 'Italy', 'Gambia',
       'Georgia', 'Germany', 'Czechoslovakia',
       "Democratic People's Republic of Korea",
       'Democratic Republic of the Congo', 'Denmark', 'Djibouti',
       'Iran (Islamic Republic of)', 'Iraq', 'Ireland', 'Cameroon',
       'Canada', 'Cayman Islands', 'Central African Republic', 'Bahamas',
       'Bahrain', 'France', 'Dominican Repu

In [49]:
#filtering out Kenya only data
kenya_pest = pest_df[pest_df["Area"] == "Kenya"]


In [50]:
#resting index
kenya_pest = kenya_pest.reset_index(drop=True)


In [51]:
#keeping only relevant columns
kenya_pest = kenya_pest.rename(columns={
    "Year": "year",
    "Value": "pesticide_usage"
})


In [52]:
kenya_pest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264 entries, 0 to 263
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   year             264 non-null    int64  
 1   Area             264 non-null    object 
 2   Item             264 non-null    object 
 3   Element          264 non-null    object 
 4   pesticide_usage  264 non-null    float64
 5   Unit             264 non-null    object 
 6   Area Code (M49)  264 non-null    object 
 7   Note             264 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 16.6+ KB


In [55]:
#handling missing values (WHY)
kenya_pest["pesticide_usage"] = kenya_pest["pesticide_usage"].fillna(
    kenya_pest["pesticide_usage"].mean()
)
kenya_pest.head()

Unnamed: 0,year,Area,Item,Element,pesticide_usage,Unit,Area Code (M49),Note
0,1999,Kenya,Pesticides (total),Agricultural Use,2124.0,t,404-01-01,Estimated using net trade
1,2002,Kenya,Pesticides (total),Agricultural Use,1665.4,t,404-01-01,Estimated using net trade
2,2006,Kenya,Pesticides (total),Agricultural Use,2192.58,t,404-01-01,Estimated using net trade
3,2008,Kenya,Pesticides (total),Agricultural Use,2157.53,t,404-01-01,Estimated using net trade
4,2014,Kenya,Pesticides (total),Agricultural Use,4660.68,t,404-01-01,Estimated using net trade
