<a href="https://colab.research.google.com/github/silvererudite/simulationAndModeling/blob/main/Gender_violence.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [290]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.rcParams["figure.figsize"] = [10, 5]
import plotly.express as px
import math

In [291]:
violence_data= pd.read_csv('/content/drive/MyDrive/gender_violence_dataset/violence_data.csv')
violence_data.head()

Unnamed: 0,RecordID,Country,Gender,Demographics Question,Demographics Response,Question,Survey Year,Value
0,1,Afghanistan,F,Marital status,Never married,... if she burns the food,01/01/2015,
1,1,Afghanistan,F,Education,Higher,... if she burns the food,01/01/2015,10.1
2,1,Afghanistan,F,Education,Secondary,... if she burns the food,01/01/2015,13.7
3,1,Afghanistan,F,Education,Primary,... if she burns the food,01/01/2015,13.8
4,1,Afghanistan,F,Marital status,"Widowed, divorced, separated",... if she burns the food,01/01/2015,13.8


In [292]:
violence_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12600 entries, 0 to 12599
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   RecordID               12600 non-null  int64  
 1   Country                12600 non-null  object 
 2   Gender                 12600 non-null  object 
 3   Demographics Question  12600 non-null  object 
 4   Demographics Response  12600 non-null  object 
 5   Question               12600 non-null  object 
 6   Survey Year            12600 non-null  object 
 7   Value                  11187 non-null  float64
dtypes: float64(1), int64(1), object(6)
memory usage: 787.6+ KB


In [293]:
violence_data.shape

(12600, 8)

In [294]:
violence_data['Country'].value_counts()

Kenya                        180
Madagascar                   180
Rwanda                       180
Togo                         180
Benin                        180
                            ... 
Congo Democratic Republic    180
Lesotho                      180
Liberia                      180
South Africa                 180
Ghana                        180
Name: Country, Length: 70, dtype: int64

In [295]:
violence_data['Demographics Response'].value_counts()

25-34                           840
Secondary                       840
Married or living together      840
Widowed, divorced, separated    840
Never married                   840
15-24                           840
Urban                           840
Rural                           840
35-49                           840
No education                    840
Employed for cash               840
Primary                         840
Employed for kind               840
Higher                          840
Unemployed                      840
Name: Demographics Response, dtype: int64

removing all rows with NaN value of `Value` column as they are of no use to us

In [296]:
new_df = violence_data.drop(['Survey Year'], axis=1)
new_df=new_df.dropna(subset=['Value'])
new_df=new_df.rename(columns={"Demographics Question": "Demographics_Question","Demographics Response": "Demographics_Response"})

In [297]:
new_df.shape

(11187, 7)

In [298]:
new_df['Demographics_Question'].value_counts()

Education         2942
Age               2274
Employment        2234
Marital status    2221
Residence         1516
Name: Demographics_Question, dtype: int64

We want to put Education & age as feature columns and populate it with corresponding entries, so let's first filter rows with values of `Education` and `Age` only

In [299]:
features = ['Education', 'Age']
hasFeatures= new_df.Demographics_Question.isin(features)
subset_df=new_df[hasFeatures]

Splitting `Demographics_Question` into two columns of features

In [300]:
is_education =  subset_df['Demographics_Question']=='Education'
hasEdu = subset_df[is_education]
hasEdu

Unnamed: 0,RecordID,Country,Gender,Demographics_Question,Demographics_Response,Question,Value
1,1,Afghanistan,F,Education,Higher,... if she burns the food,10.1
2,1,Afghanistan,F,Education,Secondary,... if she burns the food,13.7
3,1,Afghanistan,F,Education,Primary,... if she burns the food,13.8
13,1,Afghanistan,F,Education,No education,... if she burns the food,19.1
16,1,Afghanistan,M,Education,Higher,... if she burns the food,4.5
...,...,...,...,...,...,...,...
12547,280,Zimbabwe,M,Education,Secondary,... if she neglects the children,18.0
12548,350,Zimbabwe,M,Education,Higher,... if she refuses to have sex with him,2.9
12549,350,Zimbabwe,M,Education,No education,... if she refuses to have sex with him,16.2
12550,350,Zimbabwe,M,Education,Primary,... if she refuses to have sex with him,8.6


In [301]:
hasEdu['Value'] = hasEdu['Value'].div(100)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [302]:
df_edu1 = hasEdu.query("Demographics_Response =='No education'")
fig4 = px.histogram(df_edu1, x="Value",title="Distribution of people with No Education who agree with justifying gender violence")
#fig1.update_xaxes(categoryorder='category ascending')
fig4.show()

In [303]:
df_edu2 = hasEdu.query("Demographics_Response =='Primary'")
fig5 = px.histogram(df_edu2, x="Value",title="Distribution of people with Primary who agree with justifying gender violence")
#fig1.update_xaxes(categoryorder='category ascending')
fig5.show()

In [304]:
df_edu3 = hasEdu.query("Demographics_Response =='Secondary'")
fig6 = px.histogram(df_edu3, x="Value",title="Distribution of people with Secondary Education who agree with justifying gender violence")
#fig1.update_xaxes(categoryorder='category ascending')
fig6.show()

In [305]:
df_edu4 = hasEdu.query("Demographics_Response =='Higher'")
fig7 = px.histogram(df_edu4, x="Value",title="Distribution of people with Higher Education who agree with justifying gender violence")
#fig1.update_xaxes(categoryorder='category ascending')
fig7.show()

They seem to follow exponential distribution

In [306]:
# find representative probability of each state 
prob_noedu_yes = df_edu1['Value'].mean()
prob_noedu_yes

0.25403125000000015

In [307]:
prob_primaryedu_yes = df_edu2['Value'].mean()
prob_primaryedu_yes 

0.22819093406593413

In [308]:
prob_secondaryedu_yes = df_edu3['Value'].mean()
prob_secondaryedu_yes 

0.17378891820580483

In [309]:
prob_higheredu_yes = df_edu4['Value'].mean()
prob_higheredu_yes 

0.0889867021276596

Therefore the emission probabilities are

In [310]:
prob_noedu_no = 1 - prob_noedu_yes
prob_primaryedu_no = 1 - prob_primaryedu_yes 
prob_secondaryedu_no = 1 - prob_secondaryedu_yes
prob_higheredu_no = 1 - prob_higheredu_yes 

prob_higheredu_no

0.9110132978723404

In [311]:
!pip install hmmlearn



In [312]:
states = ('no_education', 'primary_education','secondary_education','higher_education')
 
observations = ('Agree', 'Disagree')
 
start_probability = {'no_education': 0.25, 'primary_education': 0.25,'secondary_education':0.25,'higher_education':0.25}
 
transition_probability = {
   'no_education' : {'no_education': 1},
   'primary_education' : {'primary_education': 1},
   'secondary_education' : {'secondary_education':1},
   'higher_education' : {'higher_education': 1}
   }
 
emission_probability = {
   'no_education' : {'Agree':prob_noedu_yes , 'Disagree': prob_noedu_no },
   'primary_education' :  {'Agree': prob_primaryedu_yes , 'Disagree': prob_primaryedu_no  },
   'secondary_education' :  {'Agree': prob_secondaryedu_yes  , 'Disagree': prob_secondaryedu_no },
   'higher_education' :  {'Agree': prob_higheredu_yes  , 'Disagree': prob_higheredu_no }
   }



In [313]:
from hmmlearn import hmm
import numpy as np

model = hmm.MultinomialHMM(n_components=4)
model.startprob_ = np.array([0.25, 0.25,0.25,0.25])
# model.transmat_ = np.array([[1, 0,0,0],
#                             [0, 1,0,0],
#                             [0, 0,1,0],
#                             [0, 0,0,1]
#                             ])
model.transmat_ = np.array([[0.1, 0.1, 0.1,0.7],
                            [0.3, 0.4, 0.2,0.1],
                            [0.3, 0.3, 0.2,0.2],
                            [0, 0, 0,1],
                            ])
model.emissionprob_ = np.array([[prob_noedu_yes,prob_noedu_no],
                                [prob_primaryedu_yes,prob_primaryedu_no],
                                [prob_secondaryedu_yes, prob_secondaryedu_no],
                                [prob_higheredu_yes, prob_higheredu_no],
                                ])


In [316]:
logprob, seq = model.decode(np.array([[0, 1]]).transpose())
print(math.exp(logprob))
print(seq)

0.04049952319439829
[0 3]
