**Exploring Relationships between:**
# Poverty, Education and Murder
#### Mohammad Alaa Alghamry
email: mohammad.alaa.alghamry@gmail.com

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
   ## Introduction

In this report we are going to compose a dataset that has the yearly **GDP**, **Education Index** and **Murders** of the world from year 1990 until 2016, out of the following datasets from the Gapminder data collection:

### Gapminder Datasets used:
**- Murders:**
   >*Total number of estimated deaths from interpersonal violence, of the world.*
   
**- GDP/capita:**
   >*GDP per capita, (Data are in constant 2010 U.S. dollars).*

**- OWID Education Index:**
   >*Education index calculated based on Avg years of schooling, taking values 0 as minimum and 15 as maximun.*

**- Population, total:**
   >*Total Population*

## **Final dataset column names:**
- country: *`country`*,
- year: *`year`*, 
- GDP: *`income`*,
- Education Index: *`educ_idx`*, 
- Murders: *`murders`*,
- Population: *`population`*,
- Murders Rate per million: *`mur_rate`*.

In [None]:
# Environmet setup
import pandas as pd
import numpy as np
import matplotlib as plt
import plotly.express as px
#%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling
##### key points:
> - Remove data of unwanted periods (before 1990 and after 2016) as the chosen period has the most consistent data.
> - Format the datasets horizontally as the datasets are layed vertically, while maintaining the 'country' column.
> - merge all datasets while maintainin correct mapping

#### Data loading

In [None]:
murders_src = pd.read_csv('../input/gapmindermurderseducationincome/murder_total_deaths.csv')
income_src = pd.read_csv('../input/gapmindermurderseducationincome/gdppercapita_us_inflation_adjusted.csv')
education_src = pd.read_csv('../input/gapmindermurderseducationincome/owid_education_idx.csv')
population_src = pd.read_csv('../input/population-data/population_total.csv')

### Data check

> Check murders dataset

In [None]:
murders_src.head(3)

> Check income dataset

In [None]:
# check murders data
income_src.head(3)

> check education dataset

In [None]:
# check murders data
education_src.head(3)

In [None]:
# check populatin data
population_src.head(3)

### Format and Clean the data
>- Before we format the table we need to drop all the year columns before 1990 and after 2016, so as we have only data of our chosen period, which is the period with the most consistent data among the selected datasets
- The loaded Gapminder data is in a horizontal format so we need to format it vertically.
- we use the function **pd.melt()** to unpivot the data set into a vertical format.
- save the unpivoted murders_src as murders
- save the unpivoted income_src as income
- save the unpivoted education_src as education

## Data Cleaning

#### Murders Dataset

>format and unpivot "make it vertical" the 'Murders' dataset

In [None]:
# we first get the column names of the smallest dataset we have 'Murders' before rotating the dataset
# we need them to select the same columns from the other datasets, to make the data consistent
sel_columns = murders_src.columns

# then we format and unpivot "make it vertical" the 'Murders' dataset
murders = murders_src.melt(id_vars='country', value_vars=sel_columns[1:], var_name='year', value_name='murders')

# check the operation
print(murders.head(3), "\t.....melt murders Done.....\n")

>check for Null values

In [None]:
# check for Null values
murders.isna().any()

>check data types

In [None]:
# check data types
murders.info()

#### we convert `murders` column type to `int` as we don't need murder counts as floats

In [None]:
# convert and check
murders.murders = murders.murders.astype('int')
murders.info()

#### GDP Dataset

>unpivot and format the same as 'murders' dataset

In [None]:
# we select only the columns we need from the 'GDP' dataset, dropping all other columns
income = income_src[sel_columns]

# then we unpivot and format the same as 'murders' dataset
income = income_src.melt(id_vars='country', value_vars=sel_columns[1:], var_name='year', value_name='income')
print(income.head(3), "\t.....melt income Done.....\n")

>check for Null values

In [None]:
# check for Null values
income.isna().any()

### income info

In [None]:
income.info()

>***As you can see there are some null values in the 'income' column of the 'income' dataset
so we need to fix that***

In [None]:
# create a function to handle filling columns of a given dataframe ond group them on certain key
# because we may need it later for the other dataset 'education' it may contain null values
def col_fillna_with_mean(df, key, column):
    """ [Warning this function replaces inplace!!!], A function to handle filling columns of a given dataframe ond group them on certain key,
    Parameters: (df) dataframe, (key) to group on and, (column) to clean"""

    for c in df[key].unique(): # we use the key to get a list to iterate on
        mean = df.set_index(key).loc[c, :].mean()[column] # we use the key to return groups on that key, then we calculate the mean then we use the `column` to return mean for that column only
        df[column] = df[column].apply(lambda x: mean if np.isnan(x) else x) # then we use column here again to access our column to do the actual replacement


In [None]:
# for each country's null income, replace with mean of income of that country
for c in income.country.unique():
    mean = income.set_index('country').loc[c, :].mean().income
    income.income = income.income.apply(lambda x: mean if np.isnan(x) else x)

In [None]:
# check for Null values
income.isna().any()

#### Education Dataset

In [None]:
# get desired columns using the 'sel_columns' list
education = education_src[sel_columns]

# again we unpovot and format the same as 'murders' and 'income' datasets
education = education_src.melt(id_vars='country', value_vars=sel_columns[1:], var_name='year', value_name='educ_idx')
print(education.head(3), "\t.....melt education_Done.....\n")

In [None]:
# check for Null
education.isna().any()

> Also here we got null values, so we need to call our function `col_fillna_with_mean()` to get the job done.

In [None]:
# call the functoin to apply the filling
col_fillna_with_mean(education, key='country', column='educ_idx')

In [None]:
# check after calling our function
education.isna().any()

#### Population Dataset

In [None]:
# get desired columns using the 'sel_columns' list
population = population_src[sel_columns]

# again we unpovot and format the same as 'murders' and 'income' datasets
population = population_src.melt(id_vars='country', value_vars=sel_columns[1:], var_name='year', value_name='population')
print(population.head(3), "\t.....melt population_Done.....\n")

In [None]:
# check after calling our function
population.isna().any()

In [None]:
population[population.country == "Venezuela"].head(3).style.format({'population': "{:,.2f}"})

# Start building the Main dataset as `df`
>*After building the dataset `income` and `educ_idx` coulumns are going to have Null values for certain countries*.
- income will have Null values for these countries: **North Korea**, **Somalia**, **Syria**`
- educ_idx will have Null values for these countries: **Micronesia, Fed. Sts.**, **North Korea**, **Timor-Leste**

>***Note:***<br>
>*I am not dropping those countries, as I need their murders data for comparisons*

#### Initialize `df`

In [None]:
# we start with assigning the 'murders' dataset to 'df'
df = murders
print(df.head(3), "\t.....initialize df from murders Done.....\n")

#### Merge `income` dataset with `df`

In [None]:
df = pd.merge(left=df, right=income, how='left', on=['country', 'year'])
print(df.head(3), "\t.....merge (df and income) Done.....\n")

> **- Get the countries that have `Null income` values**

In [None]:
list(df[df.income.isna()].country.unique())

#### Merge `education` dataset with `df`

In [None]:
df = pd.merge(left=df, right=education, how='left', on=['country', 'year'])
print(df.head(3), "\t.....merge (df and education) Done.....\n")

> **- Get the countries that have `Null educ_idx` values**

In [None]:
list(df[df.educ_idx.isna()].country.unique())

#### Merge `population` dataset with `df`

In [None]:
df = pd.merge(left=df, right=population, how='left', on=['country', 'year'])
print(df.head(3), "\n\n\t.....merge (df and population) Done.....\n")

#### Create murders rate columns `mur_rate`
>calculate the murders rate per 1 million capita: (murders / population) * 1M

In [None]:
df['mur_rate'] = (df.murders/df.population) * 1000000
print(df.head(3), "\n\n\t..... mur_rate column added .....\n")

#### fill the remaining missing values in `income` and `educ_idx` with `zeros` not with `means`

In [None]:
df.fillna(0, inplace=True)

#### check the final dataset for Null values

In [None]:
df.isna().any()

# Gathering Inshights
### Let's ask some questions

#### A quick summary

In [None]:
df.describe().style.format({'population': "{:,.2f}"})

### Which country has highest murders per year during (1990 - 2016), and a little info about the country

In [None]:
df[df['murders'] > 60000].style.format({'population': '{:,.2f}'})

### The Country That has the heighest total murders through the whole period (1990-2016), and how big is the number

In [None]:
# The Country That has the heigest total murders number through the whole period (1990-2016)
print("The Country That has the heigest murder total during the whole period:")
h_total_mur = df.groupby('country').sum().loc[:, : 'murders']
h_total_mur = h_total_mur[h_total_mur.murders == h_total_mur.murders.max()]
h_total_mur = h_total_mur.reset_index()
h_total_mur = h_total_mur.style.format({'murders': "{:,.2f}"}) # a nice touch, to identify the number
h_total_mur

### Which country has the heighest murder rate per 1 million

In [None]:
df[df['mur_rate'] > 773].style.format({'population': '{:,.2f}'})

>*As you can see querying the murders data with murder_rate makes a huge difference as it eleminates the factor of population*

## Let's do some Visual Analysis

### But first let's see the whole thing

In [None]:
fig = px.choropleth(df, locations = 'country', locationmode = 'country names', animation_frame='year',
                    hover_data=['income', 'educ_idx', 'murders'],
                    color='murders', 
                    #color_continuous_scale=['rgb(70,77,70)', 'rgb(200,59,59)', 'rgb(255,33,19, 78)'],
                    color_continuous_scale=['rgb(30,90,90)', 'rgb(200,59,59)', 'rgb(255,33,19)'],
                    height = 750,
                    title = """Murders of The World - plotting count no regard to rate for country population - year slider below""")
fig.show()

> ***use slide to change the year,and hover over any country to get more information***

In [None]:
fig = px.choropleth(df, locations = 'country', locationmode = 'country names', animation_frame='year',
                    hover_data=['income', 'educ_idx', 'murders'],
                    color='mur_rate', 
                    color_continuous_scale=['rgb(30,90,90)', 'rgb(200,59,59)', 'rgb(255,33,19)'],
                    #color_continuous_scale=['rgb(30,30,30)', 'rgb(230,100,100)', 'rgb(255,33,19, 78)'],
                    height = 750,
                    title = """Murders\' Rate of The World - plotting murder rate per 1 million per country - year slider below""")
fig.show()

> ***use slide to change the year,and hover over any country to get more information***

### if we plot the data on murders count per coutnry, for highest 50 countries in murder rate, we get this:

In [None]:
data = df.groupby('country').sum().sort_values(by='murders', ascending=False).head(50)
#text = f"""\n\n\nOnly data for the "50" highest countries in murder is shown in this plot\n
#     - Max total Murders: {data.murders[0]:,}\n
#     - Min total Murders: {data.murders[-1]:,}"""

pie = data.plot(y='murders', kind='bar', figsize = (15,5), 
                title=f"""\n\n\nOnly data for the "50" highest countries in murder is shown in this plot\n
     - Max total Murders: {data.murders[0]:,}\n
     - Min total Murders: {data.murders[-1]:,}""")

pie.legend(loc='upper right', bbox_to_anchor=(1.15,1));
#pie.text(1, 1000, text, fontsize=13, color="blue");

### but, if we plot the data on murders_rate per country we get the following:
>Note: arranged by murder counts(`murders`), exactly the same like the previous plot, not by rate (`mur_rate`), so that we can compare

##### I aggregated the `mur_rate` values for each country by `mean()`

In [None]:
data = df.groupby('country').mean().sort_values(by='mur_rate', ascending=False).head(50)

#text = f"""\n\n\nOnly data for the "50" highest countries in murder is shown in this plot\n
#     - Max murder_rate mean: {data.mur_rate.max():,.2f}\n
#     - Min murder_rate mean: {data.mur_rate.min():,.2f}"""

pie = data.plot(y='mur_rate', kind='bar', figsize = (15,5),
               title=f"""\n\n\nOnly data for the "50" highest countries in murder is shown in this plot\n
      - Max murder_rate mean: {data.mur_rate.max():,.2f}\n
      - Min murder_rate mean: {data.mur_rate.min():,.2f}""")
pie.legend(loc='upper right', bbox_to_anchor=(1.15,1));
#pie.text(-1.5, 1, text, fontsize=13, color="blue");

### Let's see the plot the murder rate per coutnry, combining `hightest 25` and `lowest 25`countries:

In [None]:
data = df.groupby('country').mean().sort_values(by='mur_rate', ascending=False)
hi_data = data.head(25)
lo_data = data.tail(25)
data = hi_data.append(lo_data)

#text = f"""\n\n\nOnly data for the "25" highest murder rate countries, and "25" lowest murder rate countries, is shown in this plot\n
#     - Max murder_rate mean: {data.mur_rate.max():,.2f}\n
#     - Min murder_rate mean: {data.mur_rate.min():,.2f}"""

pie = data.plot(y='mur_rate', kind='bar', figsize = (15,5), 
                title=f"""\n\n\nOnly data for the "25" highest murder rate countries, and "25" lowest murder rate countries, is shown in this plot\n
     - Max murder_rate mean: {data.mur_rate.max():,.2f}\n
     - Min murder_rate mean: {data.mur_rate.min():,.2f}""")
pie.legend(loc='upper right', bbox_to_anchor=(1.15,1));
#pie.text(-1.5, 1, text, fontsize=13, color="blue");

>***Interesting finding:***
- Most of the Arab Muslim countries are among the least countries in murder.

> We have also discoverd a country with a ***murder rate mean zero*** this means, this country have not had murders ***at all during 27 years***.
### Principality of Andorra

In [None]:
df.query('mur_rate == 0').groupby('country').agg({'murders':'sum', 'income':'max', 'educ_idx':'max', 'mur_rate':'mean', 'population':'max'}).style.format({'population': '{:,}'})

>Let's compare it to counties with murder rate below 5

In [None]:
agg_dict = {'murders':'sum', 'income':'max', 'educ_idx':'max', 'mur_rate':'mean', 'population':'max'}
df.query('mur_rate < 5').groupby('country').agg(agg_dict).sort_values('mur_rate').style.format({'population': '{:,}'})

# let's discover some relations

### What is the relation between `Murder rate`, `Income` and `Education`, Globally combined

In [None]:
data = df.groupby('year').agg({'educ_idx': 'mean', 'mur_rate': 'mean', 'income': 'mean'})
data.mur_rate = data.mur_rate * 130
data.educ_idx = data.educ_idx * 24000
#data.mur_rate = data.mur_rate * 1000
plot = data.plot(kind='bar', figsize=(15,5), title="Relation between murder, income and education, Globally combined", alpha=0.9)
plot.set_ylabel('Normalized Values Scale');

> ***we can clearly see a positive relation between income and education, and almost a negative relation between murders and the other two***

### What is the relation between `Income` and `Murder Rate`, for years: `1990`, `2000`, `2016`

In [None]:
data = df.query('year == "1990"')
plot = data.plot(x='income', y='mur_rate', kind='scatter', figsize=(15,5), color='cyan', alpha=1, legend=True)

data = df.query('year == "2000"')
data.plot(x='income', y='mur_rate', kind='scatter', figsize=(15,5), ax=plot, color='blue', alpha=1)

data = df.query('year == "2016"')
data.plot(x='income', y='mur_rate', kind='scatter', figsize=(15,5), ax=plot, color='magenta', alpha=1)

plot.set_ylabel('Murder Rate per Year')
plot.set_xlabel('Average Income per Capita per Year')
plot.set_title("Relation between income and murders, for years: `1990`, `2000`, `2016`");

- It seems like there is a negative relation between murder_rate and average_income

### What is the relation between `Education` and `Murder Rate`, for years: `1990`, `2000`, `2016`

In [None]:
data = df.query('year == "1990"')
plot = data.plot(x='educ_idx', y='mur_rate', kind='scatter', figsize=(15,5), color='cyan')

data = df.query('year == "2000"')
data.plot(x='educ_idx', y='mur_rate', kind='scatter', figsize=(15,5), ax=plot, color='blue')

data = df.query('year == "2016"')
data.plot(x='educ_idx', y='mur_rate', kind='scatter', figsize=(15,5), ax=plot, color='magenta')

plot.set_ylabel('Murder Rate per Year')
plot.set_xlabel('OWID Education index per Year')
plot.set_title("Relation between education and murders, for years: `1990`, `2000`, `2016`");

- Doesn't look like there is an obvious relation at all

### What is the realtion between `Income` and `Murder Rate` across the whole period

In [None]:
plot = df.plot(x='income', y='mur_rate', kind='scatter', figsize=(15,5), title="Relation between income and murders")
plot.set_ylabel("Murder Rate per Year per Country")
plot.set_xlabel("Average Income per Year per Country");

- Again it appears more prominent across the whole period

### What is the realtion between `Education` and `Murder Rate` across the whole period

In [None]:
plot = df.plot(x='educ_idx', y='mur_rate', kind='scatter', figsize=(15,5), title="Relation between income and murders")
plot.set_ylabel("Murder Rate per Year per Country")
plot.set_xlabel("Average Income per Year per Country");

- Like before no apparent relation.

# So now let's see

>what do we do if we want to plot a graph for the highest 10 countries murders' on time line

#### May be we get the highest 10 countries names in a given year

>But let's define a useful function first

In [None]:
import sys
def top_10_murdr_countries(df, n_years=None, given_years=None, auto=True, rate=True):
    """Calculates top 10 countries eacn year starting at year 1990 upto a given
    number of years and with maximum input of 27 and a minimum of 1,
    or Calculates top 10 countries for each year in a `given_years` list, but
    `auto=False` must be specified"""
    
    # check correct for numeric input
    if n_years is None:
        if auto:
            sys.exit("Error: at least one argument must be passed with dataframe")
    
    if auto:
        if n_years < 1 or n_years > 27:
            sys.exit("Error: no values below 1 nor above 27 are allowed.\n")
        
        n = n_years
        years = []
        for i in range(n):
            year = 1990
            years.append(str(year + i))
    else:
        years = given_years

    top_10_countries_each_year = []
    if rate:
        for year in years:
            top_10_murd_year = df[(df['year'] == year)].sort_values('mur_rate', ascending=False).head(10)
            top_10_murd_year = top_10_murd_year.reset_index().drop('index', axis=1)
            top_10_countries_each_year.append(top_10_murd_year['country'])
    else:
        for year in years:
            top_10_murd_year = df[(df['year'] == year)].sort_values('murders', ascending=False).head(10)
            top_10_murd_year = top_10_murd_year.reset_index().drop('index', axis=1)
            top_10_countries_each_year.append(top_10_murd_year['country'])
            
    unique, counts = np.unique(top_10_countries_each_year, return_counts=True)
    
    return dict(zip(unique, counts))


>So our function returns the highest 10 countries in murders, and a number with it -we discover that in a bit inshaa Allah-
and we want to plot the the highest 10 on a time line during the selected period
let's make the plot with those 10

### Then let's get the highest countries in murders and in murders' rate accross the whole period

In [None]:
mur_dict = top_10_murdr_countries(df, n_years=27, auto=True, rate=True)
cps = pd.Series(mur_dict).sort_values()

r = cps.plot(kind="bar", figsize=(15,3))
r.set_ylabel('Count')
r.set_xlabel('Country')
r.set_title("A plot of occurrence of each top country, at the top 10 category of murder rate");

In [None]:
mur_dict = top_10_murdr_countries(df, n_years=27, auto=True, rate=False)
cps = pd.Series(mur_dict).sort_values()

r = cps.plot(kind="bar", figsize=(15,3))
r.set_ylabel('Count')
r.set_xlabel('Country')         
r.set_title("A plot of occurrence of each top country, at the top 10 category of murder count");

### Now we can plot them on a time line, and we can see:
*`Following line plot:`*
- The change in `murders` over time for each country in of ***the 10 hightest countries in `murder count`***

*`Line plot after following plot:`*
- The change in murders rate (`mur_rate`) over itme of the countries that ***ever got into the top 10 countries in `murder rate` through out the whole period***

### Murder count per year form 1990 until 2016 for the 10 heighest countries in murder count

In [None]:
mur_dict = top_10_murdr_countries(df, n_years=27, auto=True, rate=False)
mur_ser = pd.Series(mur_dict).sort_values(ascending=False)
data = df.query('country in @mur_ser.index')

fig = px.line(data, x = 'year', y = 'murders', color = 'country',
              hover_data=['population', 'mur_rate'],
              title = 'Murder count per year form 1990 until 2016 for the 10 heighest countries in murder count')
fig.show()

### Murder Rate per year from 1990 until 2016 of the countries that ever got into the top 10 countries in `murder rate` through out the whole period

In [None]:
mur_dict = top_10_murdr_countries(df, n_years=27, auto=True, rate=True)
mur_ser_r = pd.Series(mur_dict).sort_values(ascending=False)
data = df.query('country in @mur_ser_r.index')

fig = px.line(data, x = 'year', y = 'mur_rate', color = 'country',
              hover_data = ['murders', 'population'],
              title="""Murder Rate per year of the countries that ever got into the top 10 countries in `murder rate` whole period""")
fig.show()

- *As you can see now we can easily compare the change in `Murder Count` and `Murder Rate` between `countries and through out the whole period`*

In [None]:
#Plot murders count data
data = df.groupby('country').agg({'mur_rate': 'mean', 'income': 'mean'}).sort_values('income').query('country in @mur_ser.index')
#data.mur_rate = data.murders/2
data.mur_rate = data.mur_rate * 30
plot = data.plot(kind='bar', figsize=(15,3), title="Relation between income and rate of murder in highest countries in total murder")

#Plot mur_rate data
data = df.groupby('country').agg({'mur_rate': 'mean', 'income': 'mean'}).sort_values('income').query('country in @mur_ser_r.index')
#data.murders = data.murders/2
data.mur_rate = data.mur_rate * 30
plot = data.plot(kind='bar', figsize=(15,3), title="Relation between income and rate of murder in highest murder rate countries");

<a id='conclusions'></a>
# Conclusions

## Results :
- **Poverty and Murder**: 
    - After examining those plots we could assume that there is a positive relation between poverty and murder.
- **Education and Murder**: 
    - But we cannot assume that there is a direct relation between education and murder.
- **Interesting findings:**
    - Most of the Arab Muslim countries are among the least countries in murder.
    - ***Principality of Andorra***: A state with a ***murder rate mean zero*** this means, this country have not had murders ***at all*** during ***27 years***.
    - ***Egypt***: is among the least 5 countries in murder rate through out the whole period.
    

## Limitations :
- ***The current dataset doesn't allow us to get further more accurate insights, as there are other indicators that could be affecting the Murder Rate all over the world, for example:***
    - Governament types
    - Effect of Religions
    - Effect of Secularity   
- ***Also the range of the murders dataset prevent us from getting a more broad view for our investigation.***

# Extras
### Use your mouse cursor to rotate the globe and hover over any country to get more information

In [None]:
fig = px.choropleth(df, locations = 'country', locationmode = 'country names', animation_frame='year',
                    projection = 'orthographic', hover_data=['income', 'educ_idx', 'murders'],
                    color='murders', height = 750,
                    color_continuous_scale=['rgb(70,77,70)', 'rgb(130,130,230)', 'rgb(170,59,59)', 'rgb(255,0,30)'],
                    #color_continuous_scale=['rgb(70,77,70)', 'rgb(130,130,170)', 'rgb(200,150,150)', 'rgb(255,0,30)'],
                    title = """Murders of The World - plotting count no regard to rate for country population - year slider below""")
fig.show()

In [None]:
fig = px.choropleth(df, locations = 'country', locationmode = 'country names', animation_frame='year',
                    projection = 'orthographic', hover_data=['income', 'educ_idx', 'murders'],
                    color='mur_rate', height = 750,
                    #color_continuous_scale=['rgb(70,77,70)', 'rgb(130,130,230)', 'rgb(200,59,59)', 'rgb(255,0,30)'],
                    color_continuous_scale=['rgb(70,77,70)', 'rgb(130,130,230)', 'rgb(170,59,59)', 'rgb(255,0,30)'],
                    #color_continuous_scale=['rgb(70,77,70)', 'rgb(130,130,170)', 'rgb(200,150,150)', 'rgb(255,0,30)'],
                    title = """Murders\' Rate of The World - plotting murder rate per 1 million per country - year slider below""")
fig.show()