In [2]:
import pandas as pd

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Read the CSV file into a DataFrame
df = pd.read_csv('DATA/kenyan_projects_dataset.csv')

# Display the first 5 rows
df.head()

# Print the column names and their data types


Unnamed: 0,Settlement Name,Project Type,Project Completion Status,Completion Percentage,Start Date,End Date,Project Budget (KES),Requirement,Population,Implementing Partner,Type of Settlement
0,Marsabit,School Construction,Completed,100,2018-11-27,2019-03-24,252770.964794,,77820,ActionAid,Rural
1,Kwale,Community Center,Ongoing,39,2020-03-16,,91111.210919,,42090,ActionAid,Semi-Rural
2,Migori,Water Supply,Completed,100,2019-04-20,2019-07-17,61064.558407,,29693,World Vision,Rural
3,Wajir,Community Center,Completed,100,2021-02-28,2022-01-30,418403.070343,,68435,World Bank,Rural
4,Migori,Water Supply,Ongoing,18,2021-10-01,,847017.427612,,88313,Save the Children,Rural


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Settlement Name            5000 non-null   object 
 1   Project Type               5000 non-null   object 
 2   Project Completion Status  5000 non-null   object 
 3   Completion Percentage      5000 non-null   int64  
 4   Start Date                 5000 non-null   object 
 5   End Date                   4684 non-null   object 
 6   Project Budget (KES)       5000 non-null   float64
 7   Requirement                0 non-null      float64
 8   Population                 5000 non-null   int64  
 9   Implementing Partner       5000 non-null   object 
 10  Type of Settlement         5000 non-null   object 
dtypes: float64(2), int64(2), object(7)
memory usage: 429.8+ KB


In [4]:
df['Start Date'] = pd.to_datetime(df['Start Date'])
df['End Date'] = pd.to_datetime(df['End Date'], errors='coerce')  # Coerce errors to NaT for 'Ongoing' projects

In [5]:
df['Project Completion Status'] = df['Project Completion Status'].map({'Ongoing': 0, 'Completed': 1})

In [6]:
df.head()

Unnamed: 0,Settlement Name,Project Type,Project Completion Status,Completion Percentage,Start Date,End Date,Project Budget (KES),Requirement,Population,Implementing Partner,Type of Settlement
0,Marsabit,School Construction,1,100,2018-11-27,2019-03-24,252770.964794,,77820,ActionAid,Rural
1,Kwale,Community Center,0,39,2020-03-16,NaT,91111.210919,,42090,ActionAid,Semi-Rural
2,Migori,Water Supply,1,100,2019-04-20,2019-07-17,61064.558407,,29693,World Vision,Rural
3,Wajir,Community Center,1,100,2021-02-28,2022-01-30,418403.070343,,68435,World Bank,Rural
4,Migori,Water Supply,0,18,2021-10-01,NaT,847017.427612,,88313,Save the Children,Rural


In [7]:
# Filter the dataframe to include only numeric columns
numeric_df = df.select_dtypes(include='number')

# Recompute the correlation matrix
correlation_matrix = numeric_df.corr()

# Print the correlation matrix
print("\nCorrelation Matrix (Numeric Columns):\n", correlation_matrix.to_markdown(numalign="left", stralign="left"))


Correlation Matrix (Numeric Columns):
 |                           | Project Completion Status   | Completion Percentage   | Project Budget (KES)   | Requirement   | Population   |
|:--------------------------|:----------------------------|:------------------------|:-----------------------|:--------------|:-------------|
| Project Completion Status | 1                           | 0.905603                | 0.073441               | nan           | 0.09647      |
| Completion Percentage     | 0.905603                    | 1                       | 0.0632766              | nan           | 0.0871462    |
| Project Budget (KES)      | 0.073441                    | 0.0632766               | 1                      | nan           | 0.721682     |
| Requirement               | nan                         | nan                     | nan                    | nan           | nan          |
| Population                | 0.09647                     | 0.0871462               | 0.721682              

### The analysis of the dataset reveals the following major correlations and trends:

- **Positive Correlation between Population and Project Completion Status:** This indicates that projects in areas with larger populations tend to have a higher completion rate. This could be attributed to factors such as greater visibility, increased resource allocation, or stronger community involvement in densely populated regions.

- **Positive Correlation between Project Budget (KES) and Population:** This suggests that projects catering to larger populations receive higher budget allocations. This is logical as more resources would be required to address the needs of a larger community.

- **No significant correlation between Project Completion Status and Implementing Partner:** The data doesn't show a strong association between the completion rate and the specific implementing partner. This suggests that completion rates are influenced more by other factors such as project type, location, or available resources, rather than the organization responsible for implementation.

- **'Oxfam' has the highest Completion Percentage:** Among the implementing partners, 'Oxfam' exhibits the highest average completion percentage, indicating their effectiveness in project execution.

- **The 'Requirement' column in the dataset appears to be empty and therefore couldn't be included in the analysis.** If this column is populated with relevant data in the future, it could offer additional insights into the factors influencing project outcomes.

Overall, the dataset provides valuable information about the relationships between various factors involved in project implementation in Kenya. These insights can be leveraged to inform decision-making, resource allocation, and strategic planning for future development initiatives.

We also observed the following trends related to categorical columns:

- **Correlation between Type of Settlement and Project Budget (KES):** The analysis demonstrates a clear trend where 'Urban' settlements receive the highest average project budget, followed by 'Semi-Urban', and then 'Rural' and 'Semi-Rural' settlements having comparable budget allocations. This disparity likely reflects the higher costs and complexities associated with implementing projects in urban areas.



The columns Project Type, Project Status, Infrastructure Level, Population Density, and Growth Rate are categorical and will need to be encoded before we can use them in our models. The columns Start Date and End Date are datetime, but since they have null values, we will drop these columns for this analysis. The target variable is Need New Infrastructure. We will build a classification model to predict this target variable.

We will also create a new target variable is_delayed which is 1 if Project Status is 'Delayed' and 0 otherwise to predict the likelihood of project delays.

We will drop the Start Date, End Date and Settlement Name columns as they are not relevant for our model.

In [8]:
import altair as alt
alt.renderers.enable('default') 

%matplotlib inline
# Group by `Settlement Name` and `Project Type`, count occurrences, and unstack
project_counts = df.groupby(['Settlement Name', 'Project Type']).size().unstack(fill_value=0)

# Melt the DataFrame to long format for Altair
melted_df = project_counts.reset_index().melt('Settlement Name', var_name='Project Type', value_name='Count')

# Create the chart
chart = alt.Chart(melted_df).mark_bar().encode(
    x=alt.X('Settlement Name:N', axis=alt.Axis(labelAngle=-45)),  # Rotate x-axis labels
    y=alt.Y('Count:Q', title='Number of Projects'),
    color='Project Type:N',
    tooltip=['Settlement Name', 'Project Type', 'Count']
).properties(
    title='Project Types by Settlement'
).interactive()
chart
# Save the chart


In [9]:
df_result = df.groupby('Settlement Name')['Project Budget (KES)'].sum().reset_index()

# 2. Sort the dataframe in descending order of the sum
df_result = df_result.sort_values(by='Project Budget (KES)', ascending=False)

# 3. Display the first 5 rows of the sorted dataframe
print(df_result.head().to_markdown(index=False, numalign="left", stralign="left"))

# 4. Create a bar chart
chart = alt.Chart(df_result).mark_bar().encode(
    x=alt.X('Settlement Name:N', axis=alt.Axis(labelAngle=-45)),
    y=alt.Y('Project Budget (KES):Q', title='Total Project Cost (KES)'),
    tooltip = ['Settlement Name', 'Project Budget (KES)']
).properties(
    title='Total Project Cost by Settlement'
).interactive()
chart

| Settlement Name   | Project Budget (KES)   |
|:------------------|:-----------------------|
| Mombasa           | 9.5038e+07             |
| Kisumu            | 8.83736e+07            |
| Nairobi           | 6.86565e+07            |
| Kitale            | 6.48978e+07            |
| Thika             | 5.58152e+07            |


In [10]:
df_result = df.groupby('Settlement Name')[['Project Budget (KES)', 'Population']].sum().reset_index()

# 2. Create a scatter plot
chart = alt.Chart(df_result).mark_circle().encode(
    x=alt.X('Population:Q', title='Population'),
    y=alt.Y('Project Budget (KES):Q', title='Total Project Budget (KES)'),
    tooltip = ['Settlement Name', 'Population', 'Project Budget (KES)']
)

# 3. Add labels to the points
text = chart.mark_text(
    align='left',
    baseline='middle',
    dx=5,  # Nudge labels to the right
    dy=-5  # Nudge labels upwards
).encode(
    text='Settlement Name'
)

# 4. Combine points and labels, add title
final_chart = (chart + text).properties(
    title='Population vs Total Project Budget by Settlement'
).interactive()
chart

In [11]:
df_result = df.groupby('Settlement Name')[['Project Budget (KES)', 'Population']].sum().reset_index()

# 2. Divide `Project Budget (KES)` by `Population` to get the per capita spending
df_result['Per Capita Spending'] = df_result['Project Budget (KES)'] / df_result['Population']

# 3. Sort the dataframe in descending order of the per capita spending
df_result = df_result.sort_values(by='Per Capita Spending', ascending=False)

# 4. Display the first 5 rows of the sorted dataframe
print(df_result.head().to_markdown(index=False, numalign="left", stralign="left"))

# 5. Create a bar chart with `Settlement Name` on x-axis and per capita spending on y-axis
chart = alt.Chart(df_result).mark_bar().encode(
    x=alt.X('Settlement Name:N', axis=alt.Axis(labelAngle=-45)),  # 6. Keep the X-labels at 45 degree angle
    y=alt.Y('Per Capita Spending:Q', title='Average Amount Spent Per Capita (KES)'),
    tooltip = ['Settlement Name', 'Per Capita Spending']
).properties(
    title='Average Amount of Money Spend on Project Per Capita for Each Settlement'  # 7. Add title to the chart
).interactive()
chart

| Settlement Name   | Project Budget (KES)   | Population   | Per Capita Spending   |
|:------------------|:-----------------------|:-------------|:----------------------|
| Mombasa           | 9.5038e+07             | 6353638      | 14.958                |
| Nairobi           | 6.86565e+07            | 4719751      | 14.5466               |
| Kisumu            | 8.83736e+07            | 6079418      | 14.5365               |
| Kitale            | 6.48978e+07            | 6324398      | 10.2615               |
| Malindi           | 5.56374e+07            | 5621043      | 9.89805               |


In [12]:
df_result = df.groupby('Settlement Name')[['Project Budget (KES)', 'Population']].sum().reset_index()

# 3. Divide `Project Budget (KES)` by `Population` to get the per capita spending
df_result['Per Capita Spending'] = df_result['Project Budget (KES)'] / df_result['Population']

# 4. Create a scatter plot
chart = alt.Chart(df_result).mark_circle().encode(
    x=alt.X('Population:Q', title='Population'),
    y=alt.Y('Per Capita Spending:Q', title='Per Capita Project Budget (KES)'),
    tooltip = ['Settlement Name', 'Population', 'Per Capita Spending']
)

# 5. Add labels to the points
text = chart.mark_text(
    align='left',
    baseline='middle',
    dx=5,  # Nudge labels to the right
    dy=-5  # Nudge labels upwards
).encode(
    text='Settlement Name'
)

# 6. Combine points and labels, add title
final_chart = (chart + text).properties(
    title='Population vs Per Capita Project Budget by Settlement'
).interactive()

final_chart

In [13]:
project_counts = df.groupby(['Settlement Name', 'Project Completion Status']).size().unstack(fill_value=0)

# Melt the DataFrame to long format for Altair
melted_df = project_counts.reset_index().melt('Settlement Name', var_name='Project Completion Status', value_name='Count')

# Create the chart
chart = alt.Chart(melted_df).mark_bar().encode(
    x=alt.X('Settlement Name:N', axis=alt.Axis(labelAngle=-45)),  # Rotate x-axis labels
    y=alt.Y('Count:Q', title='Number of Projects'),
    color='Project Completion Status:N',
    tooltip=['Settlement Name', 'Project Completion Status', 'Count']
).properties(
    title='Project Completion Status by Settlement'
).interactive()

# Enable the default renderer
alt.renderers.enable('default')

# Display chart
chart

In [14]:

# 2. Group by `Settlement Name` and calculate the mean of `Project Completion Status`
df_result = df.groupby('Settlement Name')['Project Completion Status'].mean().reset_index()

# 3. Sort the dataframe in descending order of the mean completion status
df_result = df_result.sort_values(by='Project Completion Status', ascending=False)

# 4. Display the first 5 rows of the sorted dataframe
print(df_result.head().to_markdown(index=False, numalign="left", stralign="left"))

# 5. Create a bar chart with `Settlement Name` on x-axis and mean completion status on y-axis
chart = alt.Chart(df_result).mark_bar().encode(
    x=alt.X('Settlement Name:N', axis=alt.Axis(labelAngle=-45)),  # 6. Keep the X-labels at 45 degree angle
    y=alt.Y('Project Completion Status:Q', title='Mean Project Completion Status'),
    tooltip = ['Settlement Name', 'Project Completion Status']
).properties(
    title='Mean Project Completion Status by Settlement'  # 7. Add title to the chart
).interactive()
chart

| Settlement Name   | Project Completion Status   |
|:------------------|:----------------------------|
| Kitale            | 0.981982                    |
| Busia             | 0.971154                    |
| Laikipia          | 0.967742                    |
| Kajiado           | 0.964286                    |
| Kericho           | 0.96                        |


### 1. Predicting Project Completion Status (Classification)

Possible Models:

- Logistic Regression
- Decision Tree
- Random Forest
- Gradient Boosting (e.g., XGBoost, LightGBM)

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# Load the data


# Preprocessing
# Convert 'Project Completion Status' to numeric (if not already done)
# df['Project Completion Status'] = df['Project Completion Status'].map({'Ongoing': 0, 'Completed': 1})

# Encode 'Type of Settlement' and 'Implementing Partner'
le = LabelEncoder()
df['Type of Settlement'] = le.fit_transform(df['Type of Settlement'])
df['Implementing Partner'] = le.fit_transform(df['Implementing Partner'])

# Select features and target
X = df[['Population', 'Project Budget (KES)', 'Type of Settlement', 'Implementing Partner']]
y = df['Project Completion Status']

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Model training
model = LogisticRegression()
model.fit(X_train, y_train)

# Model evaluation
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:\n", report)

Accuracy: 0.925
Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00        75
           1       0.93      1.00      0.96       925

    accuracy                           0.93      1000
   macro avg       0.46      0.50      0.48      1000
weighted avg       0.86      0.93      0.89      1000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


An accuracy of 0.925 means that the model correctly classified 92.5% of the observations in the testing set. This indicates that the model performed well in distinguishing between completed and ongoing projects based on the features it was trained on.

However, it is important to note that accuracy is just one metric for evaluating model performance. Other metrics, such as precision, recall, and F1-score, can be more informative depending on the specific problem.

What does this mean in the context of your Kenyan projects dataset?

It implies that your model, likely a Logistic Regression in this case, was able to predict whether a project was completed or ongoing with a high degree of success.  Out of every 100 projects it was asked to classify, it got about 92 or 93 correct.

Why is it important to consider other metrics besides accuracy?

Imbalanced datasets: If your dataset has a significantly higher number of completed projects than ongoing ones (or vice-versa), accuracy can be misleading. A model could achieve high accuracy by simply predicting the majority class most of the time.
Different types of errors: In some cases, it might be more important to minimize false positives (predicting a project is completed when it's not) or false negatives (predicting a project is ongoing when it's completed). Precision and recall are better suited to evaluate these specific types of errors.
Overall, an accuracy of 0.925 is a good starting point, but it's crucial to examine other metrics and consider the context of your problem to get a complete picture of your model's performance.

### Decision Tree

In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report

# 2. Preprocessing
# Encode categorical variables
le = LabelEncoder()
df['Type of Settlement'] = le.fit_transform(df['Type of Settlement'])
df['Implementing Partner'] = le.fit_transform(df['Implementing Partner'])

# Convert 'Project Completion Status' to numeric (if not already done)

# Split into features and target
X = df[['Population', 'Project Budget (KES)', 'Type of Settlement', 'Implementing Partner']]
y = df['Project Completion Status']

# 3. Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Model training
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# 5. Make predictions
y_pred = model.predict(X_test)

# 6. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print("Classification Report:\n", report)

Accuracy: 0.882
Classification Report:
               precision    recall  f1-score   support

           0       0.14      0.11      0.12        75
           1       0.93      0.94      0.94       925

    accuracy                           0.88      1000
   macro avg       0.53      0.53      0.53      1000
weighted avg       0.87      0.88      0.88      1000



In [17]:
df.head()

Unnamed: 0,Settlement Name,Project Type,Project Completion Status,Completion Percentage,Start Date,End Date,Project Budget (KES),Requirement,Population,Implementing Partner,Type of Settlement
0,Marsabit,School Construction,1,100,2018-11-27,2019-03-24,252770.964794,,77820,1,0
1,Kwale,Community Center,0,39,2020-03-16,NaT,91111.210919,,42090,1,1
2,Migori,Water Supply,1,100,2019-04-20,2019-07-17,61064.558407,,29693,9,0
3,Wajir,Community Center,1,100,2021-02-28,2022-01-30,418403.070343,,68435,8,0
4,Migori,Water Supply,0,18,2021-10-01,NaT,847017.427612,,88313,5,0


In [18]:
df_tan.head()

NameError: name 'df_tan' is not defined

In [None]:
#df_2=df_tan.drop(['Settlement CODE'])
df_2 = df.drop('Settlement CODE', axis=1)
import pandas as pd
from sklearn.preprocessing import LabelEncoder

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Read the CSV file into a DataFrame
df = pd.read_csv('DATA/settlement_data.csv')

# Get all columns specified in the query
categorical_columns = ['Agriculture Type', 'Water Source', 'Sanitation Facilities', 'Access to Markets', 'Climate', 'Soil Type', 'Natural Disasters', 'Land Ownership', 'Access to Financial Services', 'Social Services']

# Initialize LabelEncoder
le = LabelEncoder()

# Encode the categorical columns
for col in categorical_columns:
    df[col] = le.fit_transform(df[col])

# Print unique values for each encoded column
for col in categorical_columns:
    print(f"Unique values for {col}: {df[col].unique()}")

# Print the column names and their data types
print(df.info())

In [None]:
categorical_columns = ['ocial Services']

# Initialize LabelEncoder
le = LabelEncoder()
df['Income Level']=le.fit_transform(df['Income Level'])
# Encode the categorical columns
# Print the column names and their data types
print(df.info())

In [None]:
df.head()

In [None]:
df.to_csv('DATA/settlement_encoded.csv',index= False)

In [19]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Drop the `Settlement CODE` column
df.drop('Settlement CODE', axis=1, inplace=True)

# Split the data into features and target variable
X = df.drop('Income Level', axis=1)
y = df['Income Level']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train a Random Forest Classifier model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Print the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

KeyError: "['Settlement CODE'] not found in axis"

In [None]:
# Split the data into features and target variable
X = df.drop('Income Level', axis=1)
y = df['Income Level']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train a Random Forest Classifier model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Print the accuracy score
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

In [None]:
df.corr()


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Compute the correlation matrix
correlation_matrix = df.corr()

# Plot the heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", cbar=True)
plt.title('Correlation Heatmap')
plt.show()

# Display the correlation matrix
print(correlation_matrix.round(2))

#### The correlation matrix reveals numerous perfect correlations (1 or -1), which can obscure other potentially significant relationships. To gain a clearer understanding, we'll filter the correlation matrix to display only correlations that are less than 1 (in absolute value) and exceed a certain threshold, indicating a substantial association between features. We'll set this threshold to 0.5 (absolute value).



In [None]:
# Filter the correlation matrix to include only pairs with absolute correlation between 0.5 and 1
significant_correlations = correlation_matrix[((correlation_matrix > 0.5) | (correlation_matrix < -0.5)) & (correlation_matrix != 1.00)].round(2)

# Display the significant correlations
print("Significant Correlations:\n")
print(significant_correlations)

Here are some types of questions that can be answered with a relatively high level of confidence from this dataset:

#### Relationship between settlement characteristics: These questions explore how different aspects of the settlements, such as population, education, infrastructure, and access to services, relate to each other.

- Example: How does population size relate to the level of education in a settlement?
- Example: Is there a correlation between infrastructure development and access to financial services?
#### Comparison of settlements: You can compare different settlements based on their overall characteristics or specific aspects like education levels or access to services.

- Example: Which settlements have the highest levels of education and infrastructure development?
- Example: Which settlements have the poorest access to markets and financial services?

#### Impact of combined factors: These questions analyze how multiple factors work together to influence a particular outcome.

- Example: How do population size, education levels, and infrastructure development collectively influence income levels in a settlement?
- Example: Does the combination of access to markets and financial services have a significant impact on the overall well-being of a settlement?
#### Identification of trends or patterns: You can look for general trends or patterns in the data.

- Example: Do settlements with higher population densities tend to have better infrastructure?
- Example: Is there a general trend of increasing access to services with higher levels of education?
Caution: Due to the high degree of multicollinearity, it's crucial to be cautious when interpreting the results. The focus should be on understanding the combined influence of features or overall trends rather than isolating the effect of individual features.

In [None]:
import matplotlib.pyplot as plt
df['Primary Education Proportion'] = df['Primary Education No of People'] / df['Population']
df['Secondary Education Proportion'] = df['Secondary Education No of People'] / df['Population']
df['Higher Education Proportion'] = df['Higher Education No of People'] / df['Population']

# 1. Create subplots sharing a y-axis
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)

# 2. First subplot: Population vs Primary Education Proportion
axes[0].scatter(df['Population'], df['Primary Education Proportion'])
axes[0].set_title('Population vs. Primary Education Proportion')
axes[0].set_xlabel('Population')
axes[0].set_ylabel('Education Proportion')

# 3. Second subplot: Population vs Secondary Education Proportion
axes[1].scatter(df['Population'], df['Secondary Education Proportion'])
axes[1].set_title('Population vs. Secondary Education Proportion')
axes[1].set_xlabel('Population')

# 4. Third subplot: Population vs Higher Education Proportion
axes[2].scatter(df['Population'], df['Higher Education Proportion'])
axes[2].set_title('Population vs. Higher Education Proportion')
axes[2].set_xlabel('Population')

# 5. Adjust layout and display plot
plt.tight_layout()
plt.show()

The scatter plots illustrate the relationship between population size and the proportion of each education level in a settlement.

#### Primary Education:
There's a clear negative correlation between population size and the proportion of people with primary education. Larger settlements tend to have a lower proportion of residents with only primary education.
#### Secondary Education: 
The relationship between population size and the proportion of people with secondary education is less clear, with a slight positive trend. Larger settlements might have a slightly higher proportion of residents with secondary education.
#### Higher Education: 
There's a noticeable positive correlation between population size and the proportion of people with higher education. Larger settlements tend to have a higher proportion of residents with higher education.
These observations suggest that as the population size of a settlement increases, there's a shift towards higher levels of education. Larger settlements likely have better access to educational institutions and opportunities, leading to a higher proportion of residents attaining secondary and higher education. Conversely, smaller settlements might have limited educational resources, resulting in a larger proportion of residents having only primary education.

It's important to remember that these are just trends observed in the data, and individual settlements might deviate from these patterns. Additionally, the high degree of multicollinearity in the dataset warrants caution in interpreting these relationships, as other factors might be influencing both population size and education levels.

In [None]:
df['Primary Education No of People'].sum()

In [None]:
df['Secondary Education No of People'].sum()

In [None]:

df['Higher Education No of People']