### Working with missing values

It occurs rather frequently that DataFrames are incomplete regarding the content of all data rows. There will probably be no occasion in which complete rows are entered empty, however it may very well occur that some part of an observation is missing.<br>
The question is how to deal with this situation? Let's have a look at the following DataFrame.

In [18]:
import pandas as pd
import numpy as np
import os
from pathlib import Path
data_dir = os.path.join(Path(os.getcwd()).parents[2], "data", "ntbk_data", "04_data")
data = {'col1': [1,2,3,None,5], 'col2' : ['a',np.nan,'c','d','b'], 'col3' : [10,20,30,np.nan,np.nan], 
        'col4' : [np.nan,np.nan,np.nan,np.nan,0.6], 'col5' : [np.nan,np.nan,np.nan,np.nan,np.nan], 
        'col6' : ['a','b','c','d','e'], 'col7' : [10,25,40,np.nan,15]}

df = pd.DataFrame.from_dict(data)

In [19]:
df

Unnamed: 0,col1,col2,col3,col4,col5,col6,col7
0,1.0,a,10.0,,,a,10.0
1,2.0,,20.0,,,b,25.0
2,3.0,c,30.0,,,c,40.0
3,,d,,,,d,
4,5.0,b,,0.6,,e,15.0


In [20]:
df.isnull()

Unnamed: 0,col1,col2,col3,col4,col5,col6,col7
0,False,False,False,True,True,False,False
1,False,True,False,True,True,False,False
2,False,False,False,True,True,False,False
3,True,False,True,True,True,False,True
4,False,False,True,False,True,False,False


In [21]:
df.col1.isnull()#return a Series

0    False
1    False
2    False
3     True
4    False
Name: col1, dtype: bool

As you can see, we entered some 'missing' values on purpose, in this case, the Python __None__ type as well as the __np.nan__ type are converted and handled as __np.nan__ in the pandas DataFrame. The first thing you should do is check the DataFrame for these values so you have an overview of the completeness of your data.
Of course, you can use the __.describe()__ function here and compare the values but checking per column is more thorough.

In [22]:
#combining isnull() with sum() results in a nice overview
df.isnull().sum()

col1    1
col2    1
col3    2
col4    4
col5    5
col6    0
col7    1
dtype: int64

### Removing  data from a DataFrame


What to do if data is missing in a DataFrame? One way is to remove observations (rows) or attributes (columns) completely. This can be done if you are sure that it does not harm your information harvesting overall!<br>
Have a look at our DataFrame - it is easy to see that column 5 does not carry any valuable informartion whatsoever - let's remove it!

In [23]:
df.drop('col5', axis = 1, inplace=True)
#in such a case you may want to know, where the data came from after all or if there was an import error!

In [24]:
df

Unnamed: 0,col1,col2,col3,col4,col6,col7
0,1.0,a,10.0,,a,10.0
1,2.0,,20.0,,b,25.0
2,3.0,c,30.0,,c,40.0
3,,d,,,d,
4,5.0,b,,0.6,e,15.0


One may argue, whether there is much information contained in __col4__ as well, but let's leave it for the time being.

In [25]:
#this command removes any column where a np.nan value is found; 
#omitting the axis parameter results in the same behaviour for rows
df1 = df.dropna(axis = 1)
df1

Unnamed: 0,col6
0,a
1,b
2,c
3,d
4,e


### Imputing data

We can try to fill in the missing data using the __fillna()__ function. This makes sense for numerical data. Let's try this for a single column.

In [26]:
df.col1.fillna(4.0, inplace=True)
df

Unnamed: 0,col1,col2,col3,col4,col6,col7
0,1.0,a,10.0,,a,10.0
1,2.0,,20.0,,b,25.0
2,3.0,c,30.0,,c,40.0
3,4.0,d,,,d,
4,5.0,b,,0.6,e,15.0


We don't always want to guess the correct value, neither do we want to inspect each entry specifically, nor is this feasible if the DataFrame is large! A good measure may be to use the mean value in a column. Whether or not this makes sense depends on your knowledge of the data!

In [27]:
print(df.col7.mean())

22.5


In [28]:
df.col7.fillna(df.col7.mean(), inplace=True)
df

Unnamed: 0,col1,col2,col3,col4,col6,col7
0,1.0,a,10.0,,a,10.0
1,2.0,,20.0,,b,25.0
2,3.0,c,30.0,,c,40.0
3,4.0,d,,,d,22.5
4,5.0,b,,0.6,e,15.0


In [29]:
df.describe()

Unnamed: 0,col1,col3,col4,col7
count,5.0,3.0,1.0,5.0
mean,3.0,20.0,0.6,22.5
std,1.581139,10.0,,11.456439
min,1.0,10.0,0.6,10.0
25%,2.0,15.0,0.6,15.0
50%,3.0,20.0,0.6,22.5
75%,4.0,25.0,0.6,25.0
max,5.0,30.0,0.6,40.0


Using __describe()__ on the column gives a description on the numerical data in the DataFrame:
- count: the number of non-null elements in this column
- mean: the mean value of the column
- std: the standard deviation
- min/max: min and max value respectively
- 25%/50%/75%: The respective percentile

In [30]:
df.col2.describe()#applied on non-numerical data

count     4
unique    4
top       a
freq      1
Name: col2, dtype: object

Applied on a column, we can retrieve:
- count: the number of values
- the number of unique values
- the top value
- the frequency of the top value

In [31]:
df.col7.describe()

count     5.000000
mean     22.500000
std      11.456439
min      10.000000
25%      15.000000
50%      22.500000
75%      25.000000
max      40.000000
Name: col7, dtype: float64

### Correlation in data

Here is a link to some [housing data](https://www.kaggle.com/schirmerchad/bostonhoustingmlnd) in the Boston suburbian area. The column __RM__ describes the size in square metres per real estate, the __MEDV__ column refers to the value of the object in question.

In [32]:
dfh = pd.read_csv(os.path.join(data_dir,'housing.csv'), encoding= 'utf-8')
dfh.head()

Unnamed: 0,RM,LSTAT,PTRATIO,MEDV
0,6.575,4.98,15.3,504000.0
1,6.421,9.14,17.8,453600.0
2,7.185,4.03,17.8,728700.0
3,6.998,2.94,18.7,701400.0
4,7.147,5.33,18.7,760200.0


Using the __corr()__ function we can quickly establish correlation factors for all variables and it is obvious to see that there is a positive correlation between the size of an object and its value, as we expected!

In [33]:
dfh.corr()

Unnamed: 0,RM,LSTAT,PTRATIO,MEDV
RM,1.0,-0.612033,-0.304559,0.697209
LSTAT,-0.612033,1.0,0.360445,-0.76067
PTRATIO,-0.304559,0.360445,1.0,-0.519034
MEDV,0.697209,-0.76067,-0.519034,1.0


### Working with data


Let's work a bit with a concrete data set and apply our knowledge about data engineering / data wrangling to answer some conrete questions! Here is a [link](https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016) to a data set describing suicides rates in various countries. We import the data into a DataFrame and take an initial glimpse at it.

In [34]:
dfS = pd.read_csv(os.path.join(data_dir, 'suicide_data.csv'))

In [35]:
#rename some column names
dfS.rename(columns = {' gdp_for_year ($) ':'gdp_year', 'gdp_per_capita ($)':'gdp_cap'}, inplace = True)

In [36]:
#if you are unsure about the meaning of the specified data, look it up under the link!
dfS.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_year,gdp_cap,generation
0,Albania,1987,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,Albania,1987,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,Albania,1987,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,Albania,1987,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,Albania,1987,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


In [37]:
dfS.describe()

Unnamed: 0,year,suicides_no,population,suicides/100k pop,HDI for year,gdp_cap
count,27820.0,27820.0,27820.0,27820.0,8364.0,27820.0
mean,2001.258375,242.574407,1844794.0,12.816097,0.776601,16866.464414
std,8.469055,902.047917,3911779.0,18.961511,0.093367,18887.576472
min,1985.0,0.0,278.0,0.0,0.483,251.0
25%,1995.0,3.0,97498.5,0.92,0.713,3447.0
50%,2002.0,25.0,430150.0,5.99,0.779,9372.0
75%,2008.0,131.0,1486143.0,16.62,0.855,24874.0
max,2016.0,22338.0,43805210.0,224.97,0.944,126352.0


__Grouping__ allows for aggregation over certain columns, e.g. iw want to know all suicide values per country, we can sum all the entries with respect to each country. 

In [39]:
dfGroup = dfS.groupby(['country']).sum(numeric_only=True)
dfGroup

Unnamed: 0_level_0,year,suicides_no,population,suicides/100k pop,HDI for year,gdp_cap
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Albania,527796,1970,62325467,924.76,32.304,490788
Antigua and Barbuda,647832,11,1990228,179.14,28.140,3385212
Argentina,744000,82219,1035985431,3894.59,93.552,2944044
Armenia,596832,1905,77348173,976.21,66.252,558428
Aruba,336720,101,1259677,1596.52,0.000,4069236
...,...,...,...,...,...,...
United Arab Emirates,144540,622,36502275,94.89,19.800,3035664
United Kingdom,744000,136805,1738767780,2790.92,103.620,11869908
United States,744000,1034013,8054027201,5140.97,106.992,14608296
Uruguay,672072,13138,84068943,6538.96,80.628,2561016


Other forms of aggregation are:
- count: number of non-null observations
- sum:	Sum of values
- mean:	Mean of values
- mad:	Mean absolute deviation
- median: Arithmetic median of values
- min:	Minimum
- max:	Maximum
- mode:	Mode
- abs:	Absolute Value
- prod:	Product of values
- std:	standard deviation
- var:  variance
- sem: standard error of the mean
- quantile:	Sample quantile (value at %)
- cumsum:	Cumulative sum
- cumprod:	Cumulative product
- cummax:	Cumulative maximum
- cummin:	Cumulative minimum

If we want to retrieve parts from our DataFrame we can achieve this by using comparion: Certain values of columns which match our defined criteria.   

In [40]:
#retrieve a DataFrame listing all rows which match the string 'Germany' in the column country
#or: give me all the data regarding Germany
dfG = dfS[dfS.country == 'Germany']
dfG.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_year,gdp_cap,generation
9710,Germany,1990,male,75+ years,1516,1717700,88.26,Germany1990,0.801,1764967948917,23546,G.I. Generation
9711,Germany,1990,male,55-74 years,2406,6593100,36.49,Germany1990,0.801,1764967948917,23546,G.I. Generation
9712,Germany,1990,male,35-54 years,3302,11127100,29.68,Germany1990,0.801,1764967948917,23546,Silent
9713,Germany,1990,female,75+ years,1174,3978800,29.51,Germany1990,0.801,1764967948917,23546,G.I. Generation
9714,Germany,1990,male,25-34 years,1488,6721200,22.14,Germany1990,0.801,1764967948917,23546,Boomers


In [41]:
dfG.describe()

Unnamed: 0,year,suicides_no,population,suicides/100k pop,HDI for year,gdp_cap
count,312.0,312.0,312.0,312.0,108.0,312.0
mean,2002.5,933.532051,6489986.0,15.559904,0.881778,35164.230769
std,7.512048,886.297355,3178586.0,17.260059,0.040699,8716.933587
min,1990.0,5.0,1482300.0,0.12,0.801,23546.0
25%,1996.0,168.25,4373980.0,3.5325,0.855,27888.0
50%,2002.5,759.0,4922470.0,10.595,0.906,32783.5
75%,2009.0,1249.5,9352304.0,20.8675,0.915,43614.0
max,2015.0,3427.0,13148810.0,88.26,0.916,50167.0


In [43]:
#TASK: retrieve all data of male persons in Germany from the year 2000
dfS.loc[(dfS.country == 'Germany') & (dfS.sex == 'male') & (dfS.year == 2000)]


Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_year,gdp_cap,generation
9830,Germany,2000,male,75+ years,1069,1738251,61.5,Germany2000,0.855,1949953934034,24922,G.I. Generation
9831,Germany,2000,male,55-74 years,2528,8760853,28.86,Germany2000,0.855,1949953934034,24922,Silent
9832,Germany,2000,male,35-54 years,2937,12201076,24.07,Germany2000,0.855,1949953934034,24922,Boomers
9834,Germany,2000,male,25-34 years,1002,6118772,16.38,Germany2000,0.855,1949953934034,24922,Generation X
9835,Germany,2000,male,15-24 years,575,4709057,12.21,Germany2000,0.855,1949953934034,24922,Generation X
9840,Germany,2000,male,5-14 years,25,4562649,0.55,Germany2000,0.855,1949953934034,24922,Millenials


In [44]:
#TASK: retrieve all data of male persons in Germany from the year 2000
t1 = dfG[(dfG.sex == 'male') & (dfG.year == 2000)]
t1

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_year,gdp_cap,generation
9830,Germany,2000,male,75+ years,1069,1738251,61.5,Germany2000,0.855,1949953934034,24922,G.I. Generation
9831,Germany,2000,male,55-74 years,2528,8760853,28.86,Germany2000,0.855,1949953934034,24922,Silent
9832,Germany,2000,male,35-54 years,2937,12201076,24.07,Germany2000,0.855,1949953934034,24922,Boomers
9834,Germany,2000,male,25-34 years,1002,6118772,16.38,Germany2000,0.855,1949953934034,24922,Generation X
9835,Germany,2000,male,15-24 years,575,4709057,12.21,Germany2000,0.855,1949953934034,24922,Generation X
9840,Germany,2000,male,5-14 years,25,4562649,0.55,Germany2000,0.855,1949953934034,24922,Millenials


In [45]:
t1.describe()

Unnamed: 0,year,suicides_no,population,suicides/100k pop,HDI for year,gdp_cap
count,6.0,6.0,6.0,6.0,6.0,6.0
mean,2000.0,1356.0,6348443.0,23.928333,0.855,24922.0
std,0.0,1136.868682,3667596.0,20.865353,1.216188e-16,0.0
min,2000.0,25.0,1738251.0,0.55,0.855,24922.0
25%,2000.0,681.75,4599251.0,13.2525,0.855,24922.0
50%,2000.0,1035.5,5413914.0,20.225,0.855,24922.0
75%,2000.0,2163.25,8100333.0,27.6625,0.855,24922.0
max,2000.0,2937.0,12201080.0,61.5,0.855,24922.0


Task: Retrieve all data onwards from 2000 regarding [Generation X](https://en.wikipedia.org/wiki/Generation_X). What are the countries involved?

In [53]:
gen_x_older2000 = dfS.loc[(dfS.year > 2000) & (dfS.generation == 'Generation X')]
gen_x_older2000.country.unique()

array(['Albania', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba',
       'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain',
       'Barbados', 'Belarus', 'Belgium', 'Belize',
       'Bosnia and Herzegovina', 'Brazil', 'Bulgaria', 'Cabo Verde',
       'Canada', 'Chile', 'Colombia', 'Costa Rica', 'Croatia', 'Cuba',
       'Cyprus', 'Czech Republic', 'Denmark', 'Ecuador', 'El Salvador',
       'Estonia', 'Fiji', 'Finland', 'France', 'Georgia', 'Germany',
       'Greece', 'Grenada', 'Guatemala', 'Guyana', 'Hungary', 'Iceland',
       'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Kazakhstan',
       'Kiribati', 'Kuwait', 'Kyrgyzstan', 'Latvia', 'Lithuania',
       'Luxembourg', 'Maldives', 'Malta', 'Mauritius', 'Mexico',
       'Mongolia', 'Montenegro', 'Netherlands', 'New Zealand',
       'Nicaragua', 'Norway', 'Oman', 'Panama', 'Paraguay', 'Philippines',
       'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Republic of Korea',
       'Romania', 'Russian Federation', 

In [54]:
t2 = dfS[(dfS.year > 2000) & (dfS.generation == 'Generation X')]
t2.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_year,gdp_cap,generation
144,Albania,2001,male,25-34 years,22,206484,10.65,Albania2001,,4060758804,1451,Generation X
154,Albania,2001,female,25-34 years,4,222771,1.8,Albania2001,,4060758804,1451,Generation X
157,Albania,2002,male,25-34 years,23,206286,11.15,Albania2002,,4435078648,1573,Generation X
163,Albania,2002,female,25-34 years,7,223685,3.13,Albania2002,,4435078648,1573,Generation X
174,Albania,2003,male,25-34 years,9,205433,4.38,Albania2003,,5746945913,2021,Generation X


In [55]:
t2.country.unique()

array(['Albania', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba',
       'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain',
       'Barbados', 'Belarus', 'Belgium', 'Belize',
       'Bosnia and Herzegovina', 'Brazil', 'Bulgaria', 'Cabo Verde',
       'Canada', 'Chile', 'Colombia', 'Costa Rica', 'Croatia', 'Cuba',
       'Cyprus', 'Czech Republic', 'Denmark', 'Ecuador', 'El Salvador',
       'Estonia', 'Fiji', 'Finland', 'France', 'Georgia', 'Germany',
       'Greece', 'Grenada', 'Guatemala', 'Guyana', 'Hungary', 'Iceland',
       'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Kazakhstan',
       'Kiribati', 'Kuwait', 'Kyrgyzstan', 'Latvia', 'Lithuania',
       'Luxembourg', 'Maldives', 'Malta', 'Mauritius', 'Mexico',
       'Mongolia', 'Montenegro', 'Netherlands', 'New Zealand',
       'Nicaragua', 'Norway', 'Oman', 'Panama', 'Paraguay', 'Philippines',
       'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Republic of Korea',
       'Romania', 'Russian Federation', 

In [56]:
#Task: retrieve all data from the DataFrame regarding Generation X in Germany between 2000 and 2010 (excluding both)

dfS.loc[(dfS.generation == 'Generation X') & (dfS.country == 'Germany') & (dfS.year > 2000) & (dfS.year < 2010)]






Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_year,gdp_cap,generation
9846,Germany,2001,male,25-34 years,985,5850255,16.84,Germany2001,,1950648769575,24874,Generation X
9850,Germany,2001,female,25-34 years,221,5552605,3.98,Germany2001,,1950648769575,24874,Generation X
9858,Germany,2002,male,25-34 years,903,5614445,16.08,Germany2002,,2079136081310,26441,Generation X
9862,Germany,2002,female,25-34 years,219,5348715,4.09,Germany2002,,2079136081310,26441,Generation X
9870,Germany,2003,male,25-34 years,868,5393875,16.09,Germany2003,,2505733634312,31816,Generation X
9874,Germany,2003,female,25-34 years,225,5161703,4.36,Germany2003,,2505733634312,31816,Generation X
9882,Germany,2004,male,25-34 years,820,5196875,15.78,Germany2004,,2819245095605,35772,Generation X
9886,Germany,2004,female,25-34 years,180,5000272,3.6,Germany2004,,2819245095605,35772,Generation X
9894,Germany,2005,male,25-34 years,730,5056621,14.44,Germany2005,0.887,2861410272354,36289,Generation X
9898,Germany,2005,female,25-34 years,171,4889216,3.5,Germany2005,0.887,2861410272354,36289,Generation X


In [57]:
#Task: retrieve all data from the DataFrame regarding Generation X in Germany between 2000 and 2010 (excluding both)
t3 = t2[(t2.country == 'Germany') & (t2.year > 2000) & (t2.year < 2010)]
t3

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_year,gdp_cap,generation
9846,Germany,2001,male,25-34 years,985,5850255,16.84,Germany2001,,1950648769575,24874,Generation X
9850,Germany,2001,female,25-34 years,221,5552605,3.98,Germany2001,,1950648769575,24874,Generation X
9858,Germany,2002,male,25-34 years,903,5614445,16.08,Germany2002,,2079136081310,26441,Generation X
9862,Germany,2002,female,25-34 years,219,5348715,4.09,Germany2002,,2079136081310,26441,Generation X
9870,Germany,2003,male,25-34 years,868,5393875,16.09,Germany2003,,2505733634312,31816,Generation X
9874,Germany,2003,female,25-34 years,225,5161703,4.36,Germany2003,,2505733634312,31816,Generation X
9882,Germany,2004,male,25-34 years,820,5196875,15.78,Germany2004,,2819245095605,35772,Generation X
9886,Germany,2004,female,25-34 years,180,5000272,3.6,Germany2004,,2819245095605,35772,Generation X
9894,Germany,2005,male,25-34 years,730,5056621,14.44,Germany2005,0.887,2861410272354,36289,Generation X
9898,Germany,2005,female,25-34 years,171,4889216,3.5,Germany2005,0.887,2861410272354,36289,Generation X


In [66]:
#TASK (based on t2): List the suicide rates of the richest (per capita) 10% of all the countries between [2000,2010]
#make use of the .quantile() function

t2_richest_suic_rates = t2.loc[(t2.year > 2000) & (t2.year < 2010) & (t2.gdp_cap > t2.gdp_cap.quantile(0.9)), ['country', 'suicides/100k pop']]

print(t2_richest_suic_rates.head(10))
t2_richest_suic_rates.country.unique()


      country  suicides/100k pop
2066  Austria              15.85
2071  Austria               3.34
7565  Denmark              14.61
7570  Denmark               4.24
7577  Denmark              16.13
7582  Denmark               4.34
7589  Denmark              14.06
7594  Denmark               2.65
7602  Denmark              10.72
7606  Denmark               3.89


array(['Austria', 'Denmark', 'Finland', 'Iceland', 'Ireland',
       'Luxembourg', 'Netherlands', 'Norway', 'Qatar', 'San Marino',
       'Singapore', 'Sweden', 'Switzerland', 'United Kingdom'],
      dtype=object)

In [49]:
#TASK (based on t2): List the suicide rates of the richest (per capita) 10% of all the countries between [2000,2010]
#make use of the .quantile() function
top10 = t2[(t2.year > 2000) & (t2.year < 2010) & (t2.gdp_cap > t2.gdp_cap.quantile(0.9))]
top10.country.unique()

array(['Austria', 'Denmark', 'Finland', 'Iceland', 'Ireland',
       'Luxembourg', 'Netherlands', 'Norway', 'Qatar', 'San Marino',
       'Singapore', 'Sweden', 'Switzerland', 'United Kingdom'],
      dtype=object)

In [50]:
top10.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_year,gdp_cap,generation
2066,Austria,2008,male,25-34 years,86,542606,15.85,Austria2008,,430294287388,54294,Generation X
2071,Austria,2008,female,25-34 years,18,539635,3.34,Austria2008,,430294287388,54294,Generation X
7565,Denmark,2006,male,25-34 years,52,355908,14.61,Denmark2006,,282884912894,55354,Generation X
7570,Denmark,2006,female,25-34 years,15,353444,4.24,Denmark2006,,282884912894,55354,Generation X
7577,Denmark,2007,male,25-34 years,56,347246,16.13,Denmark2007,,319423370134,62229,Generation X


In [67]:
#TASK: solve the same question for the poorest 10%


t2_poor_suic_rates = t2.loc[(t2.year > 2000) & (t2.year < 2010) & (t2.gdp_cap < t2.gdp_cap.quantile(0.1)), ['country', 'suicides/100k pop']]

print(t2_poor_suic_rates.head(10))
t2_poor_suic_rates.country.unique()



      country  suicides/100k pop
144   Albania              10.65
154   Albania               1.80
157   Albania              11.15
163   Albania               3.13
174   Albania               4.38
175   Albania               4.04
182   Albania               7.85
186   Albania               4.95
1099  Armenia               0.70
1101  Armenia               0.36


array(['Albania', 'Armenia', 'Azerbaijan', 'Belarus', 'Bulgaria',
       'Colombia', 'Ecuador', 'El Salvador', 'Fiji', 'Georgia',
       'Guatemala', 'Guyana', 'Kazakhstan', 'Kiribati', 'Kyrgyzstan',
       'Montenegro', 'Paraguay', 'Philippines', 'Romania',
       'Russian Federation', 'Serbia', 'Sri Lanka', 'Suriname',
       'Thailand', 'Turkmenistan', 'Ukraine', 'Uzbekistan'], dtype=object)

In [71]:
#TASK: In Germany - for GenX, are the suicides rates higher for men or women?

t2.loc[t2.country == 'Germany'].groupby(['sex']).mean(numeric_only=True).loc[:, 'suicides/100k pop']







sex
female     4.735625
male      16.611875
Name: suicides/100k pop, dtype: float64

In [51]:
gM = t3[t3.sex=='male']
gM.suicides_no.sum()

6858

In [52]:
gF = t3[t3.sex=='female']
gF.suicides_no.sum()

1618

In [78]:
#TASK: Which generation has the highest abolute suicide rates in Germany?

dfG.groupby(['generation', 'year'], as_index = False)['suicides_no'].sum().groupby(['generation']).max()
#dfS.loc[dfS.country == 'Germany'].groupby(['generation']).max(numeric_only=True).loc[:,'suicides/100k pop']








Unnamed: 0_level_0,year,suicides_no
generation,Unnamed: 1_level_1,Unnamed: 2_level_1
Boomers,2015,6561
G.I. Generation,2000,6513
Generation X,2015,4322
Generation Z,2015,28
Millenials,2015,1455
Silent,2015,5303


In [73]:
dfG.head()

Unnamed: 0,country,year,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_year,gdp_cap,generation
9710,Germany,1990,male,75+ years,1516,1717700,88.26,Germany1990,0.801,1764967948917,23546,G.I. Generation
9711,Germany,1990,male,55-74 years,2406,6593100,36.49,Germany1990,0.801,1764967948917,23546,G.I. Generation
9712,Germany,1990,male,35-54 years,3302,11127100,29.68,Germany1990,0.801,1764967948917,23546,Silent
9713,Germany,1990,female,75+ years,1174,3978800,29.51,Germany1990,0.801,1764967948917,23546,G.I. Generation
9714,Germany,1990,male,25-34 years,1488,6721200,22.14,Germany1990,0.801,1764967948917,23546,Boomers


In [74]:
df1 = dfG.groupby(['generation','year'], as_index=False)['suicides_no'].sum().groupby(['generation']).max()
#(['A', 'B'], as_index=False)['C'].sum()
df1

Unnamed: 0_level_0,year,suicides_no
generation,Unnamed: 1_level_1,Unnamed: 2_level_1
Boomers,2015,6561
G.I. Generation,2000,6513
Generation X,2015,4322
Generation Z,2015,28
Millenials,2015,1455
Silent,2015,5303
