## Data

This is a historical dataset of Ukraine Deputies from 1990 till 2018. Data are scraped from Wikipedia.
The file ukraine_deputies.csv contains 3851 rows and 26 columns. Each row corresponds to an individual deputy in different Parliaments.

You can find a description of each column in a relative place for it.

Dataset was created at October 2018.

**Bottleneck:** The language of data is Ukrainian

## Imports
I am using a typical data science stack: `numpy`, `pandas`, `sklearn`, `matplotlib`. 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # for plotting
import matplotlib.pyplot as plt # for plotting
import datetime

import os
print(os.listdir("../input"))

Let's take a quick look at what the data looks like:

In [None]:
data = pd.read_csv('../input/ukraine_deputies.csv')
data.head()

In [None]:
data.columns

In [None]:
data.shape

In [None]:
data.info()

We already have two columns that represent the education of deputy. Let's concatenate and create a new one.

In [None]:
data['Education'] = (data['Alma mater'].fillna('')+data['Освіта'].fillna('')).replace(r'', np.nan)
data.drop(['Alma mater', 'Освіта'], axis=1, inplace=True)

Also expected date fields (`Start Work` and `End Work`) are not the same. At least four different formats date  we received`1992-03-12`,  `23.12.2014`,  `13.08.2015 †` and `2.12.2014[34]`.  You can see all unique dates from mentioned above columns below in hidden outputs.

The column `End Work` has missing values for the last Parliament (current). I substitute missing dates with a current date.

Let's count those observations.

In [None]:
data['Start Work'].unique()

In [None]:
data['End Work'].unique()

In [None]:
today = datetime.datetime.now().strftime('%d.%m.%Y')
data['End Work'].fillna(today, inplace=True)
data['End Work'] = data['End Work'].str.replace(r'\[\d*\]| \S*', '')
today

In [None]:
data['WorkEnd'] = data[data['Rada']==8].apply(lambda x: '-'.join(x['End Work'].split('.')[::-1]), axis=1)
data['WorkEnd'] = data['WorkEnd'].fillna(data[data['Rada']<8]['End Work'])
data['WorkEnd'] = pd.to_datetime(data['WorkEnd']).dt.date
data.drop(['End Work'], axis=1, inplace=True)

In [None]:
data['WorkStart'] = data[data['Rada']==8].apply(lambda x: '-'.join(x['Start Work'].split('.')[::-1]), axis=1)
data['WorkStart'] = data['WorkStart'].fillna(data[data['Rada']<8]['Start Work'])
data['WorkStart'] = pd.to_datetime(data['WorkStart']).dt.date
data.drop(['Start Work'], axis=1, inplace=True)

Also, I created a new column that represents the time of working. I mean amount of days.

In [None]:
data['WorkPeriod'] = (data['WorkEnd']-data['WorkStart']).dt.days

Let's compare columns before and after processing `data.info()`. Now it looks cleaner, but it not enough.

In [None]:
data.info()

## Conclusion
This concludes my starter analysis for that time! To go forward from here, click the blue "Edit Notebook" button at the top of the kernel. This will create a copy of the code and environment for you to edit. 