### Content
This compiled dataset pulled from four other datasets linked by time and place, and was built to find signals correlated to increased suicide rates among different cohorts globally, across the socio-economic spectrum.

### References
United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506

World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#

[Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook

World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/

### Inspiration
Suicide Prevention.

Цель лабораторной работы получить практические знания по работе с признаками на известном датасете статистики самоубийств.

Вам необходимо будет подготовить данные для обучения линейной модели предсказания количества самоубийств (столбец - suicides/100k pop).

Чек-лист:
0. Изучите файл annotation.txt. Там содержится информация о датасете.
1. Загрузите датасет data.csv.
2. Посмотрите на данные. Отобразите общую информацию по признакам (вспомните о describe и info). Напишите в markdown свои наблюдения.
3. Выявите пропуски, а также возможные причины их возникновения. Решите, что следует сделать с ними. Напишите в markdown свои наблюдения.
4. Оцените зависимости переменных между собой. Используйте корреляции. Будет хорошо, если воспользуетесь profile_report. Напишите в markdown свои наблюдения.
5. Определите стратегию преобразования категориальных признаков (т.е. как их сделать адекватными для моделей).
6. Найдите признаки, которые можно разделить на другие, или преобразовать в другой тип данных. Удалите лишние, при необходимости.
7. Разделите выборку на обучаемую и тестовую.
8. Обучите линейную модель. Напишите в markdown свои наблюдения по полученным результатам.

Если возникнут затруднения, то смотрите на материал практических занятий. Данного там должно хватить для выполнения всех пунктов. Желаю успеха!

In [329]:
import pandas as pd
import numpy as np

import pandas_profiling

import seaborn as sns
from matplotlib import pyplot as plt

**1. Загрузите датасет data.csv.**

In [136]:
data = pd.read_csv('data.csv')
data.head()

Unnamed: 0,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


In [9]:
#list(data)

**0. Изучите файл annotation.txt. Там содержится информация о датасете.**

- sex - пол
- age - возрастная группа
- suicides_no - количество суицидов
- population - численность населения
- suicides/100k pop - число самоубийств на 100к жителей
- country-year - страна/год
- HDI for year - Индекс человеческого развития. составной индекс, измеряющий средние достижения в трех основных измерениях человеческого развития - долгая и здоровая жизнь, знания и достойный уровень жизни.
- gdp_for_year - ВВП за год 
- gdp_per_capita - ВВП на душу начеления
- generation - поколение

**2. Посмотрите на данные. Отобразите общую информацию по признакам (вспомните о describe и info). Напишите в markdown свои наблюдения.**

In [5]:
data.describe()

Unnamed: 0,suicides_no,population,suicides/100k pop,HDI for year,gdp_per_capita ($)
count,27820.0,27820.0,27820.0,8364.0,27820.0
mean,242.574407,1844794.0,12.816097,0.776601,16866.464414
std,902.047917,3911779.0,18.961511,0.093367,18887.576472
min,0.0,278.0,0.0,0.483,251.0
25%,3.0,97498.5,0.92,0.713,3447.0
50%,25.0,430150.0,5.99,0.779,9372.0
75%,131.0,1486143.0,16.62,0.855,24874.0
max,22338.0,43805210.0,224.97,0.944,126352.0


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27820 entries, 0 to 27819
Data columns (total 10 columns):
sex                   27820 non-null object
age                   27820 non-null object
suicides_no           27820 non-null int64
population            27820 non-null int64
suicides/100k pop     27820 non-null float64
country-year          27820 non-null object
HDI for year          8364 non-null float64
 gdp_for_year ($)     27820 non-null object
gdp_per_capita ($)    27820 non-null int64
generation            27820 non-null object
dtypes: float64(2), int64(3), object(5)
memory usage: 2.1+ MB


**Вывод:** в целом всё неплохо. Есть всего один столбец с пропусками, но их около 70%. Есть выбросы по количеству суицидов. Разница между 75% квартилем и максимальным значением очень большая. Столбец 'gdp_for_year ($)' имеет числовые значения, однако в датасете они тип - object

**3. Выявите пропуски, а также возможные причины их возникновения. Решите, что следует сделать с ними. Напишите в markdown свои наблюдения.**

Именно пропуски присутствуют только в одном столбце. Возможные причины - данный показатель не считался в той или иной стране в указанном году. С одной стороны характеристика 'HDI for year' интересная и ее хотелось бы оставить, но с другой стороны очеь большое количество недостающих данных. Заполнять их медианными или модальными значениями значениями вполне возможно, размах у этого признака довольно маленький. Но если мы 70% заполним одинаковыми значениями пользы от этого показателя будет немного.  
Попробуем два варианта(если успею):
- удалить столбец 'HDI for year' целиком
- удалить все строки не содержащие значения 'HDI for year'

**4. Оцените зависимости переменных между собой. Используйте корреляции. Будет хорошо, если воспользуетесь profile_report. Напишите в markdown свои наблюдения.**

In [7]:
pandas_profiling.ProfileReport(data)

  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


0,1
Number of variables,10
Number of observations,27820
Total Missing (%),7.0%
Total size in memory,2.1 MiB
Average record size in memory,80.0 B

0,1
Numeric,5
Categorical,5
Boolean,0
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
male,13910
female,13910

Value,Count,Frequency (%),Unnamed: 3
male,13910,50.0%,
female,13910,50.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
35-54 years,4642
75+ years,4642
25-34 years,4642
Other values (3),13894

Value,Count,Frequency (%),Unnamed: 3
35-54 years,4642,16.7%,
75+ years,4642,16.7%,
25-34 years,4642,16.7%,
15-24 years,4642,16.7%,
55-74 years,4642,16.7%,
5-14 years,4610,16.6%,

0,1
Distinct count,2084
Unique (%),7.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,242.57
Minimum,0
Maximum,22338
Zeros (%),15.4%

0,1
Minimum,0
5-th percentile,0
Q1,3
Median,25
Q3,131
95-th percentile,1050
Maximum,22338
Range,22338
Interquartile range,128

0,1
Standard deviation,902.05
Coef of variation,3.7186
Kurtosis,157.17
Mean,242.57
MAD,335.99
Skewness,10.353
Sum,6748420
Variance,813690
Memory size,217.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0,4281,15.4%,
1,1539,5.5%,
2,1102,4.0%,
3,867,3.1%,
4,696,2.5%,
5,538,1.9%,
6,467,1.7%,
7,429,1.5%,
8,365,1.3%,
9,349,1.3%,

Value,Count,Frequency (%),Unnamed: 3
0,4281,15.4%,
1,1539,5.5%,
2,1102,4.0%,
3,867,3.1%,
4,696,2.5%,

Value,Count,Frequency (%),Unnamed: 3
20705,1,0.0%,
21063,1,0.0%,
21262,1,0.0%,
21706,1,0.0%,
22338,1,0.0%,

0,1
Distinct count,25564
Unique (%),91.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1844800
Minimum,278
Maximum,43805214
Zeros (%),0.0%

0,1
Minimum,278.0
5-th percentile,7195.6
Q1,97498.0
Median,430150.0
Q3,1486100.0
95-th percentile,8850200.0
Maximum,43805214.0
Range,43804936.0
Interquartile range,1388600.0

0,1
Standard deviation,3911800
Coef of variation,2.1204
Kurtosis,27.407
Mean,1844800
MAD,2221000
Skewness,4.4594
Sum,51322158436
Variance,15302000000000
Memory size,217.5 KiB

Value,Count,Frequency (%),Unnamed: 3
24000,20,0.1%,
26900,13,0.0%,
22000,12,0.0%,
20700,12,0.0%,
4900,11,0.0%,
21700,10,0.0%,
1000,10,0.0%,
20500,10,0.0%,
9000,10,0.0%,
21000,9,0.0%,

Value,Count,Frequency (%),Unnamed: 3
278,2,0.0%,
286,1,0.0%,
287,1,0.0%,
290,1,0.0%,
291,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
43139910,1,0.0%,
43240905,1,0.0%,
43509335,1,0.0%,
43607902,1,0.0%,
43805214,1,0.0%,

0,1
Distinct count,5298
Unique (%),19.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,12.816
Minimum,0
Maximum,224.97
Zeros (%),15.4%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.92
Median,5.99
Q3,16.62
95-th percentile,50.53
Maximum,224.97
Range,224.97
Interquartile range,15.7

0,1
Standard deviation,18.962
Coef of variation,1.4795
Kurtosis,12.166
Mean,12.816
MAD,12.575
Skewness,2.9634
Sum,356540
Variance,359.54
Memory size,217.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,4281,15.4%,
0.29,72,0.3%,
0.32,69,0.2%,
0.34,55,0.2%,
0.37,52,0.2%,
0.33,49,0.2%,
0.3,48,0.2%,
0.41,47,0.2%,
0.22,46,0.2%,
0.31,46,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,4281,15.4%,
0.02,5,0.0%,
0.03,8,0.0%,
0.04,14,0.1%,
0.05,10,0.0%,

Value,Count,Frequency (%),Unnamed: 3
182.32,1,0.0%,
185.37,1,0.0%,
187.06,1,0.0%,
204.92,1,0.0%,
224.97,1,0.0%,

0,1
Distinct count,2321
Unique (%),8.3%
Missing (%),0.0%
Missing (n),0

0,1
Lithuania2008,12
Ireland1997,12
Uzbekistan2009,12
Other values (2318),27784

Value,Count,Frequency (%),Unnamed: 3
Lithuania2008,12,0.0%,
Ireland1997,12,0.0%,
Uzbekistan2009,12,0.0%,
Ecuador1998,12,0.0%,
Guatemala1997,12,0.0%,
Argentina1997,12,0.0%,
Luxembourg1999,12,0.0%,
Puerto Rico1986,12,0.0%,
Russian Federation2006,12,0.0%,
Australia1997,12,0.0%,

0,1
Distinct count,306
Unique (%),1.1%
Missing (%),69.9%
Missing (n),19456
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.7766
Minimum,0.483
Maximum,0.944
Zeros (%),0.0%

0,1
Minimum,0.483
5-th percentile,0.619
Q1,0.713
Median,0.779
Q3,0.855
95-th percentile,0.912
Maximum,0.944
Range,0.461
Interquartile range,0.142

0,1
Standard deviation,0.093367
Coef of variation,0.12022
Kurtosis,-0.64791
Mean,0.7766
MAD,0.077889
Skewness,-0.30088
Sum,6495.5
Variance,0.0087173
Memory size,217.5 KiB

Value,Count,Frequency (%),Unnamed: 3
0.772,84,0.3%,
0.888,84,0.3%,
0.713,84,0.3%,
0.7609999999999999,72,0.3%,
0.909,72,0.3%,
0.83,72,0.3%,
0.8270000000000001,72,0.3%,
0.7929999999999999,72,0.3%,
0.7559999999999999,72,0.3%,
0.867,60,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.483,12,0.0%,
0.513,12,0.0%,
0.522,12,0.0%,
0.539,12,0.0%,
0.542,12,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.935,12,0.0%,
0.94,12,0.0%,
0.941,12,0.0%,
0.942,24,0.1%,
0.944,12,0.0%,

0,1
Distinct count,2321
Unique (%),8.3%
Missing (%),0.0%
Missing (n),0

0,1
181977476217,12
441975282335,12
251373036671,12
Other values (2318),27784

Value,Count,Frequency (%),Unnamed: 3
181977476217,12,0.0%,
441975282335,12,0.0%,
251373036671,12,0.0%,
25562251656,12,0.0%,
1179659529660,12,0.0%,
430040370,12,0.0%,
2331005587,12,0.0%,
37440673478,12,0.0%,
17903681693,12,0.0%,
5252629000000,12,0.0%,

0,1
Distinct count,2233
Unique (%),8.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,16866
Minimum,251
Maximum,126352
Zeros (%),0.0%

0,1
Minimum,251
5-th percentile,935
Q1,3447
Median,9372
Q3,24874
95-th percentile,54294
Maximum,126352
Range,126101
Interquartile range,21427

0,1
Standard deviation,18888
Coef of variation,1.1198
Kurtosis,4.9378
Mean,16866
MAD,14185
Skewness,1.9635
Sum,469225040
Variance,356740000
Memory size,217.5 KiB

Value,Count,Frequency (%),Unnamed: 3
1299,36,0.1%,
2303,36,0.1%,
4104,36,0.1%,
996,24,0.1%,
30850,24,0.1%,
1077,24,0.1%,
24654,24,0.1%,
2916,24,0.1%,
36289,24,0.1%,
5590,24,0.1%,

Value,Count,Frequency (%),Unnamed: 3
251,12,0.0%,
291,12,0.0%,
313,12,0.0%,
345,12,0.0%,
357,12,0.0%,

Value,Count,Frequency (%),Unnamed: 3
113120,12,0.0%,
120423,12,0.0%,
121315,12,0.0%,
122729,12,0.0%,
126352,12,0.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Generation X,6408
Silent,6364
Millenials,5844
Other values (3),9204

Value,Count,Frequency (%),Unnamed: 3
Generation X,6408,23.0%,
Silent,6364,22.9%,
Millenials,5844,21.0%,
Boomers,4990,17.9%,
G.I. Generation,2744,9.9%,
Generation Z,1470,5.3%,

Unnamed: 0,sex,age,suicides_no,population,suicides/100k pop,country-year,HDI for year,gdp_for_year ($),gdp_per_capita ($),generation
0,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


**Вывод:** 
- столбцы 'suicides_no'(a), 'population'(b) являются составными столбцами 'suicides/100k pop'(c). Формула c=a/b * 100k. Так что если у нас целевая функция 'suicides/100k pop', то прогнозировать это значение имея, 'suicides_no', 'population' не интересно
- столбец 'age' и столбец 'generation' имеют по 6 уникальных значений, но они не взаимозаменяемы, так как в каждой из этих подгрупп разное количество элементов.  
- забавно, но датасет имеет по равному количеству элементов в столбце 'sex'.  
- выброс по показателю 'suicides_no' уже отмечал.  
- 'HDI for year' имеет достаточно высокую корреляцию с 'gdp_per_capita ($)', поэтому выберем вариант с удалением столбца 'HDI for year', так как он имеет большое количество пропусков и есть показатель, коррелирующий с ним очень сильно.


**5. Определите стратегию преобразования категориальных признаков (т.е. как их сделать адекватными для моделей).**

Будем преобразовывать пол и возраст и поколение по методу OHE. Использовать будем или возраст, или покаление.

**6. Найдите признаки, которые можно разделить на другие, или преобразовать в другой тип данных. Удалите лишние, при необходимости.**

In [143]:
# Уберем из названий столбцов пробелы
data.rename(columns={'suicides/100k pop': 'suicides/100k_pop',
                     'HDI for year': 'HDI_for_year',
                     ' gdp_for_year ($) ': 'gdp_for_year_$',
                     ' gdp per capita ($) ': 'gdp_for_capita_$'}, inplace=True)

In [146]:
# Не могу понять но последний столбец не переименовался из кодa в предыдущей строки. Повторю процедуру
data.rename(columns={'gdp_per_capita ($)': 'gdp_for_capita_$'}, inplace=True)

In [147]:
data.head()

Unnamed: 0,sex,age,suicides_no,population,suicides/100k_pop,country-year,HDI_for_year,gdp_for_year_$,gdp_for_capita_$,generation
0,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X
1,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent
2,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X
3,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation
4,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers


In [151]:
# Преобразуем столбец 'gdp_for_year ($)' в числовой
data['gdp_for_year_$'] = data['gdp_for_year_$'].str.strip().str.replace(',', '').astype('int64')

In [154]:
# Разделим страну и год
data['country'] = data['country-year'].str[:-4]
data['year'] = data['country-year'].str[-4:].astype(int)

In [160]:
# Это не работает, просто оставил на память этапов поиска решения убрать запятые
#for i in len(data.iloc[:, [7]]):
#    data.iloc[i, [7]] = int(str(data.iloc[i, [7]].values).replace(',', '').strip("['']"))

In [234]:
# Исключим выбросы по 'suicides_no'. Уберем значения больше 95-th percentile - 1050 (взято из profile)
df_1 = data[data['suicides_no'] <= 1050]
df_2 = df_1.reset_index().drop(['index'], axis=1) # чтобы не было каши с индексами

In [236]:
df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26429 entries, 0 to 26428
Data columns (total 12 columns):
sex                  26429 non-null object
age                  26429 non-null object
suicides_no          26429 non-null int64
population           26429 non-null int64
suicides/100k_pop    26429 non-null float64
country-year         26429 non-null object
HDI_for_year         8004 non-null float64
gdp_for_year_$       26429 non-null int64
gdp_for_capita_$     26429 non-null int64
generation           26429 non-null object
country              26429 non-null object
year                 26429 non-null int32
dtypes: float64(2), int32(1), int64(4), object(5)
memory usage: 2.3+ MB


In [315]:
# Разложим по OHE столбец 'sex'
from sklearn import preprocessing

In [238]:
sex = df_2['sex'].to_numpy().reshape(-1, 1) # хотя не очень понимаю зачем здесь reshape

In [247]:
oh_encoder = preprocessing.OneHotEncoder()
oh_encoder.fit(sex)
oh_result1 = oh_encoder.transform(sex).toarray()
#oh_result1

In [243]:
sex_columns_names = df_2['sex'].unique()
sex_columns = ['sex_{}'.format(i) for i in reversed(sex_columns_names)] # reversed чтобы 1 стояла там где надо
#sex_df.index = df_2.index
sex_df = pd.DataFrame(oh_result1, columns=sex_columns)
df_sexOHE = pd.concat([df_2, sex_df], axis=1)

In [246]:
#df_sexOHE.head()

In [376]:
# Разложим по OHE столбец 'age'
age = df_2['age'].to_numpy().reshape(-1, 1) # хотя не очень понимаю зачем здесь reshape

In [377]:
oh_encoder = preprocessing.OneHotEncoder()
oh_encoder.fit(age)
oh_result2 = oh_encoder.transform(age).toarray()
#oh_result2

In [378]:
a = df_sexOHE['age'].unique()

In [379]:
a[1], a[2], a[3], a[4], a[5]  = a[3], a[1], a[5], a[4], a[2] # чтобы наименования группам соответствовали

In [380]:
age_columns = ['{}'.format(i) for i in a]
age_df = pd.DataFrame(oh_result2, columns=a)
df_age_sex_OHE = pd.concat([df_sexOHE, age_df], axis=1)

In [381]:
df_age_sex_OHE.head()

Unnamed: 0,sex,age,suicides_no,population,suicides/100k_pop,country-year,HDI_for_year,gdp_for_year_$,gdp_for_capita_$,generation,country,year,sex_female,sex_male,15-24 years,25-34 years,35-54 years,5-14 years,55-74 years,75+ years
0,male,15-24 years,21,312900,6.71,Albania1987,,2156624900,796,Generation X,Albania,1987,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
1,male,35-54 years,16,308000,5.19,Albania1987,,2156624900,796,Silent,Albania,1987,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,female,15-24 years,14,289700,4.83,Albania1987,,2156624900,796,Generation X,Albania,1987,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,male,75+ years,1,21800,4.59,Albania1987,,2156624900,796,G.I. Generation,Albania,1987,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
4,male,25-34 years,9,274300,3.28,Albania1987,,2156624900,796,Boomers,Albania,1987,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0


In [382]:
df_age_sex_OHE.drop(['HDI_for_year', 'sex', 'age', 'country-year'], axis=1, inplace=True)

In [383]:
# Отбор столбцов для первой модели

In [413]:
X = df_age_sex_OHE.drop(['generation', 'suicides/100k_pop', 'country', 'gdp_for_year_$', 'population'], axis=1)
y = df_age_sex_OHE['suicides/100k_pop']

In [414]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

In [415]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

estimator = LinearRegression()
estimator.fit(X_train, y_train) # Обучение


y_pred = estimator.predict(X_test)

print("R2: \t", r2_score(y_test, y_pred))
print("RMSE: \t", np.sqrt(mean_squared_error(y_test, y_pred)))
print("MAE: \t", mean_absolute_error(y_test, y_pred))

R2: 	 0.3229841155779729
RMSE: 	 14.757881311670177
MAE: 	 9.299343171874549


In [387]:
# Разложим по OHE столбец 'generation'
generation = df_age_sex_OHE['generation'].to_numpy().reshape(-1, 1) # хотя не очень понимаю зачем здесь reshape

In [388]:
oh_encoder = preprocessing.OneHotEncoder()
oh_encoder.fit(generation)
oh_result3 = oh_encoder.transform(generation).toarray()
#oh_result3

In [389]:
g = df_age_sex_OHE['generation'].unique()

In [390]:
gen_columns = ['{}'.format(i) for i in g]
gen_df = pd.DataFrame(oh_result3, columns=g)
df_age_sex_gen_OHE = pd.concat([df_age_sex_OHE, gen_df], axis=1)

In [391]:
df_age_sex_gen_OHE.head()

Unnamed: 0,suicides_no,population,suicides/100k_pop,gdp_for_year_$,gdp_for_capita_$,generation,country,year,sex_female,sex_male,...,35-54 years,5-14 years,55-74 years,75+ years,Generation X,Silent,G.I. Generation,Boomers,Millenials,Generation Z
0,21,312900,6.71,2156624900,796,Generation X,Albania,1987,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,16,308000,5.19,2156624900,796,Silent,Albania,1987,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,14,289700,4.83,2156624900,796,Generation X,Albania,1987,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,1,21800,4.59,2156624900,796,G.I. Generation,Albania,1987,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
4,9,274300,3.28,2156624900,796,Boomers,Albania,1987,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [422]:
X = df_age_sex_gen_OHE.drop(['generation', 'suicides/100k_pop', 'country', 'year', 'population'], axis=1)
y = df_age_sex_gen_OHE['suicides/100k_pop']

In [423]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

In [424]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

estimator = LinearRegression()
estimator.fit(X_train, y_train) # Обучение


y_pred = estimator.predict(X_test)

print("R2: \t", r2_score(y_test, y_pred))
print("RMSE: \t", np.sqrt(mean_squared_error(y_test, y_pred)))
print("MAE: \t", mean_absolute_error(y_test, y_pred))

R2: 	 0.3352204191677225
RMSE: 	 14.623907107897486
MAE: 	 9.133980735491225


In [425]:
pandas_profiling.ProfileReport(df_age_sex_gen_OHE)

  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


0,1
Number of variables,22
Number of observations,26429
Total Missing (%),0.0%
Total size in memory,4.3 MiB
Average record size in memory,172.0 B

0,1
Numeric,6
Categorical,2
Boolean,14
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,996
Unique (%),3.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,98.918
Minimum,0
Maximum,1050
Zeros (%),16.2%

0,1
Minimum,0
5-th percentile,0
Q1,2
Median,20
Q3,102
95-th percentile,511
Maximum,1050
Range,1050
Interquartile range,100

0,1
Standard deviation,180.64
Coef of variation,1.8262
Kurtosis,8.1548
Mean,98.918
MAD,117.6
Skewness,2.7867
Sum,2614298
Variance,32631
Memory size,206.6 KiB

Value,Count,Frequency (%),Unnamed: 3
0,4281,16.2%,
1,1539,5.8%,
2,1102,4.2%,
3,867,3.3%,
4,696,2.6%,
5,538,2.0%,
6,467,1.8%,
7,429,1.6%,
8,365,1.4%,
9,349,1.3%,

Value,Count,Frequency (%),Unnamed: 3
0,4281,16.2%,
1,1539,5.8%,
2,1102,4.2%,
3,867,3.3%,
4,696,2.6%,

Value,Count,Frequency (%),Unnamed: 3
1043,2,0.0%,
1045,3,0.0%,
1048,2,0.0%,
1049,1,0.0%,
1050,4,0.0%,

0,1
Distinct count,24189
Unique (%),91.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1335200
Minimum,278
Maximum,28461855
Zeros (%),0.0%

0,1
Minimum,278.0
5-th percentile,6973.4
Q1,88218.0
Median,384900.0
Q3,1240300.0
95-th percentile,5873000.0
Maximum,28461855.0
Range,28461577.0
Interquartile range,1152100.0

0,1
Standard deviation,2601800
Coef of variation,1.9486
Kurtosis,21.149
Mean,1335200
MAD,1526200
Skewness,4.0424
Sum,35288322916
Variance,6769200000000
Memory size,206.6 KiB

Value,Count,Frequency (%),Unnamed: 3
24000,20,0.1%,
26900,13,0.0%,
20700,12,0.0%,
22000,12,0.0%,
4900,11,0.0%,
9000,10,0.0%,
1000,10,0.0%,
20500,10,0.0%,
21700,10,0.0%,
28600,9,0.0%,

Value,Count,Frequency (%),Unnamed: 3
278,2,0.0%,
286,1,0.0%,
287,1,0.0%,
290,1,0.0%,
291,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
26488792,1,0.0%,
26979566,1,0.0%,
27475364,1,0.0%,
27971096,1,0.0%,
28461855,1,0.0%,

0,1
Distinct count,4859
Unique (%),18.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,11.638
Minimum,0
Maximum,224.97
Zeros (%),16.2%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.77
Median,5.29
Q3,14.93
95-th percentile,44.316
Maximum,224.97
Range,224.97
Interquartile range,14.16

0,1
Standard deviation,17.777
Coef of variation,1.5275
Kurtosis,15.027
Mean,11.638
MAD,11.514
Skewness,3.2495
Sum,307580
Variance,316.01
Memory size,206.6 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,4281,16.2%,
0.29,72,0.3%,
0.32,69,0.3%,
0.34,55,0.2%,
0.37,52,0.2%,
0.33,49,0.2%,
0.3,48,0.2%,
0.41,47,0.2%,
0.31,46,0.2%,
0.22,46,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,4281,16.2%,
0.02,5,0.0%,
0.03,8,0.0%,
0.04,14,0.1%,
0.05,10,0.0%,

Value,Count,Frequency (%),Unnamed: 3
177.57,1,0.0%,
177.61,1,0.0%,
187.06,1,0.0%,
204.92,1,0.0%,
224.97,1,0.0%,

0,1
Distinct count,2321
Unique (%),8.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,296880000000
Minimum,46919625
Maximum,18120714000000
Zeros (%),0.0%

0,1
Minimum,46919625
5-th percentile,709450000
Q1,8154300000
Median,41509000000
Q3,218540000000
95-th percentile,1365900000000
Maximum,18120714000000
Range,18120667080375
Interquartile range,210380000000

0,1
Standard deviation,970690000000
Coef of variation,3.2697
Kurtosis,128.98
Mean,296880000000
MAD,384250000000
Skewness,9.7832
Sum,7846155841655576
Variance,9.4224e+23
Memory size,206.6 KiB

Value,Count,Frequency (%),Unnamed: 3
1368625150,12,0.0%,
49209523810,12,0.0%,
1793754805,12,0.0%,
99886577331,12,0.0%,
53274304222,12,0.0%,
60004630234,12,0.0%,
167775274725,12,0.0%,
78039572222,12,0.0%,
176192886551,12,0.0%,
5438537482,12,0.0%,

Value,Count,Frequency (%),Unnamed: 3
46919625,12,0.0%,
47515189,12,0.0%,
47737955,12,0.0%,
54832578,12,0.0%,
56338028,12,0.0%,

Value,Count,Frequency (%),Unnamed: 3
15517926000000,4,0.0%,
16155255000000,4,0.0%,
16691517000000,4,0.0%,
17427609000000,4,0.0%,
18120714000000,3,0.0%,

0,1
Distinct count,2233
Unique (%),8.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,16531
Minimum,251
Maximum,126352
Zeros (%),0.0%

0,1
Minimum,251
5-th percentile,925
Q1,3406
Median,8937
Q3,23984
95-th percentile,54510
Maximum,126352
Range,126101
Interquartile range,20578

0,1
Standard deviation,18933
Coef of variation,1.1453
Kurtosis,5.3181
Mean,16531
MAD,14025
Skewness,2.0525
Sum,436907486
Variance,358460000
Memory size,206.6 KiB

Value,Count,Frequency (%),Unnamed: 3
2303,36,0.1%,
1299,35,0.1%,
4104,33,0.1%,
1122,24,0.1%,
1552,24,0.1%,
939,24,0.1%,
3719,24,0.1%,
1845,24,0.1%,
3960,24,0.1%,
1664,24,0.1%,

Value,Count,Frequency (%),Unnamed: 3
251,12,0.0%,
291,12,0.0%,
313,12,0.0%,
345,12,0.0%,
357,12,0.0%,

Value,Count,Frequency (%),Unnamed: 3
113120,12,0.0%,
120423,12,0.0%,
121315,12,0.0%,
122729,12,0.0%,
126352,12,0.0%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Generation X,6133
Silent,5976
Millenials,5737
Other values (3),8583

Value,Count,Frequency (%),Unnamed: 3
Generation X,6133,23.2%,
Silent,5976,22.6%,
Millenials,5737,21.7%,
Boomers,4519,17.1%,
G.I. Generation,2594,9.8%,
Generation Z,1470,5.6%,

0,1
Distinct count,101
Unique (%),0.4%
Missing (%),0.0%
Missing (n),0

0,1
Austria,382
Mauritius,382
Iceland,382
Other values (98),25283

Value,Count,Frequency (%),Unnamed: 3
Austria,382,1.4%,
Mauritius,382,1.4%,
Iceland,382,1.4%,
Netherlands,382,1.4%,
Malta,372,1.4%,
Argentina,372,1.4%,
Luxembourg,372,1.4%,
Israel,372,1.4%,
Belgium,372,1.4%,
Ecuador,372,1.4%,

0,1
Distinct count,32
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2001.3
Minimum,1985
Maximum,2016
Zeros (%),0.0%

0,1
Minimum,1985
5-th percentile,1987
Q1,1995
Median,2002
Q3,2008
95-th percentile,2014
Maximum,2016
Range,31
Interquartile range,13

0,1
Standard deviation,8.469
Coef of variation,0.0042318
Kurtosis,-1.0488
Mean,2001.3
MAD,7.2227
Skewness,-0.16326
Sum,52891302
Variance,71.724
Memory size,206.6 KiB

Value,Count,Frequency (%),Unnamed: 3
2009,1017,3.8%,
2001,1008,3.8%,
2010,1004,3.8%,
2007,984,3.7%,
2000,984,3.7%,
2003,983,3.7%,
2002,982,3.7%,
2011,979,3.7%,
2006,973,3.7%,
2008,972,3.7%,

Value,Count,Frequency (%),Unnamed: 3
1985,551,2.1%,
1986,551,2.1%,
1987,620,2.3%,
1988,561,2.1%,
1989,591,2.2%,

Value,Count,Frequency (%),Unnamed: 3
2012,920,3.5%,
2013,911,3.4%,
2014,885,3.3%,
2015,698,2.6%,
2016,159,0.6%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.51383

0,1
1.0,13580
0.0,12849

Value,Count,Frequency (%),Unnamed: 3
1.0,13580,51.4%,
0.0,12849,48.6%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.48617

0,1
0.0,13580
1.0,12849

Value,Count,Frequency (%),Unnamed: 3
0.0,13580,51.4%,
1.0,12849,48.6%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.17129

0,1
0.0,21902
1.0,4527

Value,Count,Frequency (%),Unnamed: 3
0.0,21902,82.9%,
1.0,4527,17.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.16743

0,1
0.0,22004
1.0,4425

Value,Count,Frequency (%),Unnamed: 3
0.0,22004,83.3%,
1.0,4425,16.7%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.15612

0,1
0.0,22303
1.0,4126

Value,Count,Frequency (%),Unnamed: 3
0.0,22303,84.4%,
1.0,4126,15.6%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.17443

0,1
0.0,21819
1.0,4610

Value,Count,Frequency (%),Unnamed: 3
0.0,21819,82.6%,
1.0,4610,17.4%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.16293

0,1
0.0,22123
1.0,4306

Value,Count,Frequency (%),Unnamed: 3
0.0,22123,83.7%,
1.0,4306,16.3%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.16781

0,1
0.0,21994
1.0,4435

Value,Count,Frequency (%),Unnamed: 3
0.0,21994,83.2%,
1.0,4435,16.8%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.17099

0,1
0.0,21910
1.0,4519

Value,Count,Frequency (%),Unnamed: 3
0.0,21910,82.9%,
1.0,4519,17.1%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.09815

0,1
0.0,23835
1.0,2594

Value,Count,Frequency (%),Unnamed: 3
0.0,23835,90.2%,
1.0,2594,9.8%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.23206

0,1
0.0,20296
1.0,6133

Value,Count,Frequency (%),Unnamed: 3
0.0,20296,76.8%,
1.0,6133,23.2%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.055621

0,1
0.0,24959
1.0,1470

Value,Count,Frequency (%),Unnamed: 3
0.0,24959,94.4%,
1.0,1470,5.6%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.21707

0,1
0.0,20692
1.0,5737

Value,Count,Frequency (%),Unnamed: 3
0.0,20692,78.3%,
1.0,5737,21.7%,

0,1
Distinct count,2
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.22612

0,1
0.0,20453
1.0,5976

Value,Count,Frequency (%),Unnamed: 3
0.0,20453,77.4%,
1.0,5976,22.6%,

Unnamed: 0,suicides_no,population,suicides/100k_pop,gdp_for_year_$,gdp_for_capita_$,generation,country,year,sex_female,sex_male,15-24 years,25-34 years,35-54 years,5-14 years,55-74 years,75+ years,Generation X,Silent,G.I. Generation,Boomers,Millenials,Generation Z
0,21,312900,6.71,2156624900,796,Generation X,Albania,1987,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,16,308000,5.19,2156624900,796,Silent,Albania,1987,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,14,289700,4.83,2156624900,796,Generation X,Albania,1987,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,1,21800,4.59,2156624900,796,G.I. Generation,Albania,1987,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
4,9,274300,3.28,2156624900,796,Boomers,Albania,1987,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [426]:
# Это, конечно, "проклятье размерности", но разложим по OHE столбец 'country'
country = df_age_sex_gen_OHE['country'].to_numpy().reshape(-1, 1) # хотя не очень понимаю зачем здесь reshape

In [427]:
oh_encoder = preprocessing.OneHotEncoder()
oh_encoder.fit(country)
oh_result4 = oh_encoder.transform(country).toarray()
#oh_result4

In [428]:
coun = df_age_sex_gen_OHE['country'].unique()

In [429]:
coun_columns = ['{}'.format(i) for i in coun]
coun_df = pd.DataFrame(oh_result4, columns=coun)
df_age_sex_gen_coun_OHE = pd.concat([df_age_sex_gen_OHE, coun_df], axis=1)

In [430]:
df_age_sex_gen_coun_OHE.head()

Unnamed: 0,suicides_no,population,suicides/100k_pop,gdp_for_year_$,gdp_for_capita_$,generation,country,year,sex_female,sex_male,...,Thailand,Trinidad and Tobago,Turkey,Turkmenistan,Ukraine,United Arab Emirates,United Kingdom,United States,Uruguay,Uzbekistan
0,21,312900,6.71,2156624900,796,Generation X,Albania,1987,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,16,308000,5.19,2156624900,796,Silent,Albania,1987,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,14,289700,4.83,2156624900,796,Generation X,Albania,1987,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,21800,4.59,2156624900,796,G.I. Generation,Albania,1987,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,9,274300,3.28,2156624900,796,Boomers,Albania,1987,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [450]:
X = df_age_sex_gen_coun_OHE.drop(['generation', 'suicides/100k_pop', 'country', 'year', 'suicides_no', 'population'], axis=1)
y = df_age_sex_gen_coun_OHE['suicides/100k_pop']

In [451]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)

In [452]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

estimator = LinearRegression()
estimator.fit(X_train, y_train) # Обучение


y_pred = estimator.predict(X_test)

print("R2: \t", r2_score(y_test, y_pred))
print("RMSE: \t", np.sqrt(mean_squared_error(y_test, y_pred)))
print("MAE: \t", mean_absolute_error(y_test, y_pred))

R2: 	 0.4990121073681648
RMSE: 	 12.695160599624229
MAE: 	 8.270679447975112


**Вывод:** число самоубийств на душу населения сильно зависит от страны проживания, так как ее включение позволило повысить достоверность модели. При этом в модели обучена без 'suicides_no', 'population'. Если их включить, то значение коэффициента R2 поднимится всего до 0,53