# Marriage and Divorce Trends in Kazakhstan in 2000 - 2022
### 1. Aims, goals and background context

#### 1.1 Introduction
Marriage and divorce statistics serve as critical indicators of societal dynamics, reflecting cultural norms, economic conditions, and demographic shifts within a country. In Kazakhstan, as in many nations, understanding these trends provides essential insights into the fabric of society and its ongoing evolution. The goal of this paper is to explore the demographic and socio-economic factors influencing marriage and divorce trends in Kazakhstan, utilizing data from authoritative sources such as stat.gov.kz, and make use of public datasets to uncover patterns and draw some meaningful conclusions.  

I want to specifically address how the marriage and divorce statistics differ in urban and rural areas of Kazakhstan, take into consideration the demographic situation in the corresponding years, explore if the amount of marriages and divorces affected by social and economic factors such as unemployment rate, level of education and GDP per capita.

#### 1.2 Goals of the paper
Main goals of this paper:
1. Provide demographic context and illustrate marriage statistics across different regions of Kazakhstan (rural and urban) from 2000 to 2022.
2. Investigate differences in divorce rates between urban and rural areas considering corresponding population data.
3. Analyze correlations between marriage and divorce rates with factors such as unemployment rate, level of education and GDP per capita using relevant datasets.
4. Summarize key findings from the analysis of marriage and divorce trends, and reflect on challenges encountered during the study as well as some predictive models on marriages and divorces in Kazakhstan.

#### 1.3 Background

This coursework was mostly influenced by me stumbling upon a very interesting paper presenting an economic side of the marriage. Not only does it show that marriage makes people happier and richer, thereby potentially benefiting the nation economically, but studies even take into account that marriage reduces the risk of nursing home admission, potentially lowering nursing home costs. [1].


### 2. Data preprocessing

#### 2.1 Data used

All the data used in this paper are datasets and time series from the official Bureau of National Statistics of Kazakhstan (stat.gov.kz). Some of the datasets were present in the English language but for most of the parts work had to be done to translate the datasets from Kazakh and Russian languages to English. The data is public for any usage.

The datasets used in this paper specifically are:
1. Population in Kazakhstan 2000-2022
2. Marriages in Kazakhstan 2000-2022 [ENG] [KAZ]
3. Divorces in Kazakhstan 2000-2022 [ENG]
4. GDP per capita 2000-2022 [RUS]
5. Gross ratio of young people studying in educational institutes [ENG]

Also, as most of the data was in the format of dashboards, some preprocessing alongside with translating had taken place.


#### 2.2 Data cleansing

Firstly, we start by taking a look on overall demographics in Kazakhstan. The presented dataset `data/population-kazakhstan-by-year.xlsx` is in the form of two-dimensional board with years on the x-axis and regions on the y-axis. It also cascades the data on the urban and rural regions. 
I would want to remove the region details, leaving only three rows: urban population, rural population and overall. Also I trimmed years above 2022 as they are not in the spectre of this research.


In [1]:
import pandas as pd

# Reading the dataset
file_path = 'data/population-kazakhstan-by-year.xlsx'

df = pd.read_excel(file_path, sheet_name='All population')
df.head(100)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,The population of the Republic of Kazakhstan,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,"people, at the beginning of the year",,,,
4,,2000,2001,2002,2003,2004,2005,2006,2007,2008,...,2015,2016,2017,2018,2019,2020,2021,2022.0,2023.0,2024.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
64,Soltustik Kazakhstan,443363,436440,429655,426865,427674,438496,435379,430816,423691,...,328599,320659,313098,306847,301844,295929,290531,279968.0,275169.0,270749.0
65,Turkistan*,1218809,1222746,1232817,1257093,1283571,1312893,1341742,1373650,1441449,...,1540662,1563094,1576596,1588183,1594568,1610492,1627068,1579918.0,1599983.0,1614071.0
66,Ulytau,-,-,-,-,-,-,-,-,-,...,-,-,-,-,-,-,-,46424.0,46321.0,46157.0
67,Shygys Kazakhstan,627448,620309,613539,603902,598389,588731,580417,573978,652452,...,567896,560030,551263,537987,529375,519867,512325,250050.0,246835.0,243106.0


In [2]:
# creating a new dataframe to populate and export as .csv file
df_new = pd.DataFrame(columns=['Year', 'Rural', 'Urban', 'Overall'])

# indexes of the overall, urban and rural populations in the original dataset
overall_population_row = 6
urban_population_row = 28
rural_population_row = 50

for i in range (0, 23):
    overall_population = df.iloc[overall_population_row, i+1]
    urban_population = df.iloc[urban_population_row, i+1]
    rural_population = df.iloc[rural_population_row, i+1]
    
    # populate the new dataframe
    df_new = pd.concat([df_new, pd.DataFrame([{'Year':2000+i, 'Rural':rural_population, 'Urban':urban_population, 'Overall':overall_population}])], ignore_index=True)

# format the data in integer values
df_new['Year'] = df_new['Year'].astype(int)
df_new['Rural'] = df_new['Rural'].astype(int)
df_new['Urban'] = df_new['Urban'].astype(int) 
df_new['Overall'] = df_new['Overall'].astype(int)

df_new.head(23)

Unnamed: 0,Year,Rural,Urban,Overall
0,2000,6504075,8397566,14901641
1,2001,6452211,8413399,14865610
2,2002,6421728,8429331,14851059
3,2003,6409685,8457152,14866837
4,2004,6432958,8518242,14951200
5,2005,6460116,8614651,15074767
6,2006,6522771,8696520,15219291
7,2007,6563629,8833249,15396878
8,2008,7305571,8265935,15571506
9,2009,7319451,8662919,15982370


Now saving it as a new .csv file:

In [3]:
csv_file = 'data/processed/population.csv'
df_new.to_csv(csv_file, index=False)
print(f"dataframe saved to '{csv_file}'")

dataframe saved to 'data/processed/population.csv'


Here we manually identified where the important data is located, stored the indexes in the corresponding variables and used them to iteratively extract the data. We could use a more automated approach: realising that all the needed data are in the rows with first column "Republic of Kazakhstan" preceded by a type of population data, we could use conditionals to find and processed the needed rows.

Now let's move on to a pair of datasets on marriages and divorces. The dataset is presented in Kazakh language, so we will need to translate and process it.
At the end we will have our data transformed like this:

![data_transform](other/datatransform.png)

For that we will scrape the necessary data, remove unused data, group the data by the year, reverse the columns and rows of the initial dataset, beautify and sort it by the year.

In [4]:
translations = {
    'marriages': 'Тіркелген некелер саны(Адам)',
    'divorces': 'Тіркелген ажырасулар саны(Адам)',
    'Rural': 'ауылдық жер',
    'Urban': 'қалалық жер',
    'Overall': 'Барлығы',
    'Region': 'ӘАОЖ(каталог бойынша)',
    'Type': 'ОЖА'
}

dfm = pd.read_csv('data/marriages-kaz.csv')
dfd = pd.read_csv('data/divorces-kaz.csv')

# renaming the region column so we can parse it out easily
dfm = dfm.rename(columns={translations['Region']: 'Region'})
dfd = dfd.rename(columns={translations['Region']: 'Region'})    

# filter out the regions, leaving only country statistics
dfm = dfm[dfm['Region'] == 'ҚАЗАҚСТАН РЕСПУБЛИКАСЫ']
dfd = dfd[dfd['Region'] == 'ҚАЗАҚСТАН РЕСПУБЛИКАСЫ']

# removing " year" postfix in date
dfm['DAT'] = dfm['DAT'].str.replace(' жыл', '')
dfd['DAT'] = dfd['DAT'].str.replace(' жыл', '')

Here we've stumbled upon poorly formatted csv file. As a delimiter, is some places `\t` is used and in some others 3 spaces. As the delimeters were inconsistent, changing the delimeter in `pd.read_csv()` did not do the work. That brought the data to an unparseable format. I've replaced all the tabs and spaces between values with commas.

![alternative text](other/tabs_in_df.png)

In [5]:
# creating new dataframes as the old ones are very verbose and not normalized
dfm_new = pd.DataFrame(columns={'Year', 'Rural', 'Urban', 'Overall'})
dfd_new = pd.DataFrame(columns={'Year', 'Rural', 'Urban', 'Overall'})


# a function to be used on both marriages and divorces dataframes
def update_dataframe(df_old, df_new):
    for i, row in df_old.iterrows():
        indexes = df_new.loc[df_new['Year'] == row['DAT']].index
        if row[translations['Type']] == translations['Rural']:
            if indexes.size == 0:
                df_new = pd.concat([df_new, pd.DataFrame([{'Year': row['DAT'], 'Rural': row['VAL'], 'Urban': 0, 'Overall': 0}])], ignore_index=True)
            else:
                df_new.loc[indexes[0], 'Rural'] = row['VAL']
        elif row[translations['Type']] == translations['Urban']:
            if indexes.size == 0:
                df_new = pd.concat([df_new, pd.DataFrame([{'Year': row['DAT'], 'Rural': 0, 'Urban': row['VAL'], 'Overall': 0}])], ignore_index=True)
            else:
                df_new.loc[indexes[0], 'Urban'] = row['VAL']
        elif row[translations['Type']] == translations['Overall']:
            if indexes.size == 0:
                df_new = pd.concat([df_new, pd.DataFrame([{'Year': row['DAT'], 'Rural': 0, 'Urban': 0, 'Overall': row['VAL']}])], ignore_index=True)
            else:
                df_new.loc[indexes[0], 'Overall'] = row['VAL']
    
    return df_new

dfm_new = update_dataframe(dfm, dfm_new)
dfd_new = update_dataframe(dfd, dfd_new)

# sort the results by year
dfm_new = dfm_new[['Year', 'Rural', 'Urban', 'Overall']].sort_values(by='Year')
dfd_new = dfd_new[['Year', 'Rural', 'Urban', 'Overall']].sort_values(by='Year')

# cleanup spaces in count columns
columns_to_clean = ['Rural', 'Urban', 'Overall']
for col in columns_to_clean:
    dfm_new[col] = dfm_new[col].str.replace(' ', '')
    dfd_new[col] = dfd_new[col].str.replace(' ', '')

# remove 2023 year from dataframe
dfm_new = dfm_new.drop(23).reset_index(drop=True)
dfd_new = dfd_new.drop(23).reset_index(drop=True)

dfm_new.head(100)


Unnamed: 0,Year,Rural,Urban,Overall
0,2000,38867,52006,90873
1,2001,38457,54395,92852
2,2002,40457,58529,98986
3,2003,43620,66794,110414
4,2004,45172,69513,114685
5,2005,45971,77074,123045
6,2006,51582,85622,137204
7,2007,60611,85768,146379
8,2008,53045,82527,135572
9,2009,55260,85525,140785


In [6]:
csv_file_dfm = 'data/processed/marriages.csv'
csv_file_dfd = 'data/processed/divorces.csv'
dfm_new.to_csv(csv_file_dfm, index=False)
dfd_new.to_csv(csv_file_dfd, index=False)
print(f"dataframes saved to '{csv_file_dfm}', '{csv_file_dfd}'")

dataframes saved to 'data/processed/marriages.csv', 'data/processed/divorces.csv'


Marriage and divorce rates exhibit significant correlations with GDP per capita, reflecting complex interactions between economic conditions and societal behaviors. Regions with higher GDP per capita typically experience elevated marriage rates, attributed to greater economic security and stability (Stevenson & Wolfers, 2007) [2]. This phenomenon underscores how economic prosperity encourages individuals to commit to long-term partnerships, benefiting from financial assurances and opportunities.

Now let's move to cleansing the GDP per capita dataset. The plan is to rename the columns, remove the data that is not covered in this research both the entries and the columns like 'External debt'.

In [7]:
df_gdp = pd.read_csv('data/kazakhstan-gdp-rus.csv')

# rename the columns in English neat variants
df_gdp = df_gdp.rename(columns={'Год': 'Year', 'ВВП (ППС)(в млрд долл. США)': 'GDP in billions USD', 'ВВП на душу населения (ППС)(в долл. США)': 'GDP per capita in USD'})    

# filter out years out of [1999, 2023]
df_gdp = df_gdp[df_gdp['Year'] > 1999]
df_gdp = df_gdp[df_gdp['Year'] < 2023]

# replace commas with dots
df_gdp['GDP in billions USD'] = df_gdp['GDP in billions USD'].str.replace(',', '.')

# delete commas in GDP per capita (here identifying thousands)
df_gdp['GDP per capita in USD'] = df_gdp['GDP per capita in USD'].str.replace(',', '')

# format as float type
df_gdp['GDP in billions USD'] = df_gdp['GDP in billions USD'].astype(float)
df_gdp['GDP per capita in USD'] = df_gdp['GDP per capita in USD'].astype(float)

# now take only the columns important for our research
df_gdp = df_gdp[['Year', 'GDP in billions USD', 'GDP per capita in USD']]
df_gdp.reset_index(drop=True, inplace=True)

Preview the dataset and save it as csv:

In [8]:
df_gdp.head()

Unnamed: 0,Year,GDP in billions USD,GDP per capita in USD
0,2000,117.3,7890.0
1,2001,136.2,9168.0
2,2002,151.8,10211.0
3,2003,169.2,11318.0
4,2004,190.6,12642.0


In [9]:
csv_file = 'data/processed/gdp.csv'
df_gdp.to_csv(csv_file, index=False)
print(f"dataframe saved to '{csv_file}'")

dataframe saved to 'data/processed/gdp.csv'
