## Final Project Submission: Classifying the Drivers of Venezuelan Migration in Colombia

* Student name: **Serena Quiroga**
* Student pace: **self paced**
* Scheduled project review date/time: **January 17, 2020**
* Instructor name: **Jeff Herman**
* Blog post URL:


# Project Overview

**Problem Statement:** For this project, I have sourced household survey data from the National Administrative Department of Statistics of Colombia (DANE) to use in building a classification model for determining the primary motivation for Venezuelan migrants in Colombia.

## Project Details:
- **Data Used:** Microdata was downloaded from the sites below and merged into a final dataset used for this project. Individual microdata csv files are available in this repo as well.
    - 2019 GEIH (*Gran Encuesta Integrada de Hogares* = Grand Integrated Household Survey): http://microdatos.dane.gov.co/index.php/catalog/599/study-description
    - 2019 Migration Module of GEIH: http://microdatos.dane.gov.co/index.php/catalog/641/related_materials
- **Packages Used:**
    - RandomForest
    - GridSearchCV
    - 

### Data Dictionary for Final Variables Used

## Project Approach: OSEMiN

For this project, I am using the OSEMiN approach - Obtain Scrub Explore Model iNterpret.
<img src="OSEMiN_approach.png">

**Obtain**: 
- Data was obtained from the Colombian National Department of Statistics, specifically from the 2019 GEIH survey.
- Data was merged into one complete dataset.
- Column names were changed to more closely reflect the final variables used.

**Scrub**: 
- The data was sliced to only include a subset of the total data, specifically focused on respondents that originated from Venezuela.

**Explore**:
- initial exploration was conducted to understand how variables behaved in relation to the target variable
- visualizations

**Model**:
- A baseline classification model was produced
- Hyperparameter tuning
- GridSearch

**Interpret**:
- Summary of findings and resulting visualizations.

## Libraries Used

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.ensemble import ExtraTreesClassifier

import pyreadstat

In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Obtain

The datasets we are obtaining from DANE are broken apart by Module of the survey and per month.  We will use January 2019 sav data files, retrieving each and then merging on our specific respondents of interest - Venezuelan born respondents.  Of the survey modules, we will use the Migration Module (MM), the General Characteristics/Demographics Module, and the Employment Effort Module from two of the three geographic sample designations - metro area and administrative head of each department surveyed.

**GEIH 2019**:
**Link to Migration Module (MM) Microdata library:**
http://microdatos.dane.gov.co/index.php/catalog/641/get_microdata
    - Módulo de Migración = Migration Module

**Link to non-MM Modules of the GEIH Survey Microdata libray:** http://microdatos.dane.gov.co/index.php/catalog/599/get_microdata
    - Fuerza de Trabajo = Employment Effort
    - Caracteristicas Generales Personas = General Personal Characteristics

## Import sav files with metadata 

The individual sav data files can be found in the Data folder of this repo. We will now use `pyreadstat` to read in sav files with their label encodings stored in a meta data container and meta data dictionary.

In [3]:
# Area - General Personal Characteristics
area_cg_df, area_cg_meta = pyreadstat.read_sav('Data/Enero/Area_Caracteristicas_generales_Personas.sav')
area_cg_meta_dict = dict(zip(area_cg_meta.column_names, area_cg_meta.column_labels))

In [4]:
# Area - Employment Efforts
area_ft_df, area_ft_meta = pyreadstat.read_sav('Data/Enero/Area_Fuerza_de_trabajo.sav')
area_ft_meta_dict = dict(zip(area_ft_meta.column_names, area_ft_meta.column_labels))

In [5]:
# Admin Head - General Personal Characteristics
cb_cg_df, cg_meta = pyreadstat.read_sav('Data/Enero/Cabecera_Caracteristicas_generales_Personas.sav')
cb_cg_meta_dict = dict(zip(cg_meta.column_names, cg_meta.column_labels))

In [6]:
# Admin Head - Employment Efforts
cb_ft_df, cg_ft_meta = pyreadstat.read_sav('Data/Enero/Cabecera_Fuerza_de_trabajo.sav')
cb_ft_meta_dict = dict(zip(cg_ft_meta.column_names, cg_ft_meta.column_labels))

In [7]:
# Migration Module
mm_df, mm_meta = pyreadstat.read_sav('Data/Enero/Enero_MM.sav')
mm_meta_dict = dict(zip(mm_meta.column_names, mm_meta.column_labels))

In [8]:
mm_meta_dict

{'Directorio': 'DIRECTORIO',
 'Secuencia_p': 'SECUENCIA_P',
 'Orden': 'ORDEN',
 'P6074': '¿…….. siempre ha vivido en este  municipio?',
 'P756': 'Dónde nació…..:',
 'P756S1': 'En otro Municipio: Departamento:____________________',
 'P756S3': 'En otro país:',
 'P755': '¿Dónde vivía …. , hace cinco años?',
 'P755S1': '¿Dónde vivía …. , hace cinco años? Departamento:',
 'P755S3': '¿Dónde vivía …. , hace cinco años?  En otro pais',
 'P754': 'El lugar donde vivía ……. hace cinco años era:',
 'P753': '¿Dónde vivía …. , hace 12 meses?',
 'P753S1': '¿Dónde vivía …. , hace 12 meses? Departamento:',
 'P753S3': '¿Dónde vivía …. , hace 12 meses? En otro pais',
 'P752': 'El lugar donde vivía ……. hace 12 meses era:',
 'P1662': '¿Cuál fue el principal motivo por el que …. Cambió el lugar donde residia hace 12 meses?',
 'Mes': 'Mes',
 'Fex_c_2011': 'Factor de expansión'}

In [9]:
df_list = [area_cg_df, area_ft_df, cb_cg_df, cb_ft_df, mm_df]
for df in df_list:
    df.shape
    df.head()

(29676, 37)

Unnamed: 0,DIRECTORIO,SECUENCIA_P,ORDEN,HOGAR,REGIS,P6016,P6020,P6030S1,P6030S3,P6040,...,P6175,P6210,P6210S1,P6220,P6269,AREA,ESC,MES,DPTO,fex_c_2011
0,4804535.0,1.0,1.0,1.0,10,1.0,2.0,5.0,1948.0,70.0,...,,3.0,1.0,,,5,1.0,1,5,930.626305
1,4804535.0,1.0,2.0,1.0,10,2.0,1.0,9.0,1980.0,38.0,...,,3.0,3.0,,,5,3.0,1,5,930.626305
2,4804535.0,1.0,3.0,1.0,10,3.0,2.0,8.0,1995.0,23.0,...,,5.0,11.0,2.0,,5,11.0,1,5,930.626305
3,4804536.0,1.0,1.0,1.0,10,1.0,2.0,8.0,1997.0,21.0,...,,5.0,11.0,2.0,,5,11.0,1,5,1097.211116
4,4804536.0,1.0,2.0,1.0,10,1.0,2.0,10.0,2014.0,4.0,...,1.0,2.0,0.0,,,5,0.0,1,5,1097.211116


(24955, 27)

Unnamed: 0,DIRECTORIO,SECUENCIA_P,ORDEN,HOGAR,REGIS,AREA,P6290,P6290S1,P6230,P6240,...,P6310S1,P6320,P6330,P6340,P6350,P6351,MES,FT,DPTO,fex_c_2011
0,4804535.0,1.0,1.0,1.0,50,5,,,1.0,4.0,...,,,,,,,1,1.0,5,930.626305
1,4804535.0,1.0,2.0,1.0,50,5,,,2.0,1.0,...,,,,,,,1,1.0,5,930.626305
2,4804535.0,1.0,3.0,1.0,50,5,,,3.0,1.0,...,,,,,,,1,1.0,5,930.626305
3,4804536.0,1.0,1.0,1.0,50,5,,,1.0,1.0,...,,,,,,,1,1.0,5,1097.211116
4,4804537.0,1.0,1.0,1.0,50,5,,,1.0,1.0,...,,,,,,,1,1.0,5,827.297496


(56644, 38)

Unnamed: 0,DIRECTORIO,SECUENCIA_P,ORDEN,HOGAR,REGIS,P6016,P6020,P6030S1,P6030S3,P6040,...,P6210,P6210S1,P6220,P6269,CLASE,ESC,MES,DPTO,fex_c_2011,AREA
0,4804535.0,1.0,1.0,1.0,10,1.0,2.0,5.0,1948.0,70.0,...,3.0,1.0,,,1,1.0,1,5,930.626305,5
1,4804535.0,1.0,2.0,1.0,10,2.0,1.0,9.0,1980.0,38.0,...,3.0,3.0,,,1,3.0,1,5,930.626305,5
2,4804535.0,1.0,3.0,1.0,10,3.0,2.0,8.0,1995.0,23.0,...,5.0,11.0,2.0,,1,11.0,1,5,930.626305,5
3,4804536.0,1.0,1.0,1.0,10,1.0,2.0,8.0,1997.0,21.0,...,5.0,11.0,2.0,,1,11.0,1,5,1097.211116,5
4,4804536.0,1.0,2.0,1.0,10,1.0,2.0,10.0,2014.0,4.0,...,2.0,0.0,,,1,0.0,1,5,1097.211116,5


(46815, 28)

Unnamed: 0,DIRECTORIO,SECUENCIA_P,ORDEN,HOGAR,REGIS,AREA,CLASE,P6290,P6290S1,P6230,...,P6310S1,P6320,P6330,P6340,P6350,P6351,MES,FT,DPTO,fex_c_2011
0,4804535.0,1.0,1.0,1.0,50,5,1,,,1.0,...,,,,,,,1,1.0,5,930.626305
1,4804535.0,1.0,2.0,1.0,50,5,1,,,2.0,...,,,,,,,1,1.0,5,930.626305
2,4804535.0,1.0,3.0,1.0,50,5,1,,,3.0,...,,,,,,,1,1.0,5,930.626305
3,4804536.0,1.0,1.0,1.0,50,5,1,,,1.0,...,,,,,,,1,1.0,5,1097.211116
4,4804537.0,1.0,1.0,1.0,50,5,1,,,1.0,...,,,,,,,1,1.0,5,827.297496


(62538, 18)

Unnamed: 0,Directorio,Secuencia_p,Orden,P6074,P756,P756S1,P756S3,P755,P755S1,P755S3,P754,P753,P753S1,P753S3,P752,P1662,Mes,Fex_c_2011
0,4804535.0,1.0,1.0,1.0,2.0,5.0,,2.0,,,1.0,2.0,,,1.0,,1,930.626305
1,4804535.0,1.0,2.0,1.0,1.0,,,2.0,,,1.0,2.0,,,1.0,,1,930.626305
2,4804535.0,1.0,3.0,1.0,1.0,,,2.0,,,1.0,2.0,,,1.0,,1,930.626305
3,4804536.0,1.0,1.0,2.0,2.0,23.0,,2.0,,,1.0,2.0,,,1.0,,1,1097.211116
4,4804536.0,1.0,2.0,1.0,1.0,,,1.0,,,,2.0,,,1.0,,1,1097.211116


Now let's combine the meta dictionaries into one so that we can get a better understanding of the variables.

In [10]:
full_meta_dict = dict(mm_meta_dict, **cb_cg_meta_dict)
full_meta_dict.update(cb_ft_meta_dict)
full_meta_dict.update(area_cg_meta_dict)
full_meta_dict.update(area_ft_meta_dict)

In [11]:
full_meta_dict

{'Directorio': 'DIRECTORIO',
 'Secuencia_p': 'SECUENCIA_P',
 'Orden': 'ORDEN',
 'P6074': '¿…….. siempre ha vivido en este  municipio?',
 'P756': 'Dónde nació…..:',
 'P756S1': 'En otro Municipio: Departamento:____________________',
 'P756S3': 'En otro país:',
 'P755': '¿Dónde vivía …. , hace cinco años?',
 'P755S1': '¿Dónde vivía …. , hace cinco años? Departamento:',
 'P755S3': '¿Dónde vivía …. , hace cinco años?  En otro pais',
 'P754': 'El lugar donde vivía ……. hace cinco años era:',
 'P753': '¿Dónde vivía …. , hace 12 meses?',
 'P753S1': '¿Dónde vivía …. , hace 12 meses? Departamento:',
 'P753S3': '¿Dónde vivía …. , hace 12 meses? En otro pais',
 'P752': 'El lugar donde vivía ……. hace 12 meses era:',
 'P1662': '¿Cuál fue el principal motivo por el que …. Cambió el lugar donde residia hace 12 meses?',
 'Mes': 'Mes',
 'Fex_c_2011': 'Factor de expansión',
 'DIRECTORIO': 'Directorio',
 'SECUENCIA_P': 'Secuencia_p',
 'ORDEN': 'Orden',
 'HOGAR': 'Hogar',
 'REGIS': 'Registro de la encuesta'

# Scrub

## Merging Data

We will first start by concatenating the area and cabecera survey modules, followed by a left merge with our migration module on the specific subset of respondents we're interested in for this project.

In [12]:
# Concatenate the General Personal Characteristics dataframes 
cg_frames = [area_cg_df, cb_cg_df]
cg_df = pd.concat(cg_frames, sort=True)
cg_df.shape

(86320, 38)

In [13]:
# Concatenate the Employment Efforts dataframes
ft_frames = [area_ft_df, cb_ft_df]
ft_df = pd.concat(ft_frames, sort=True)
ft_df.shape

(71770, 28)

In [14]:
# Merge the General Personal Characteristics and Employment Efforts dataframes
cg_ft = pd.merge(cg_df, ft_df, how="inner", on=['DIRECTORIO', 'SECUENCIA_P', 
                                                'ORDEN', 'AREA', 'CLASE', 'HOGAR', 'DPTO', 'MES'])
cg_ft.head()

Unnamed: 0,AREA,CLASE,DIRECTORIO,DPTO,ESC,HOGAR,MES,ORDEN,P6016,P6020,...,P6300,P6310,P6310S1,P6320,P6330,P6340,P6350,P6351,REGIS_y,fex_c_2011_y
0,5,,4804535.0,5,1.0,1.0,1,1.0,1.0,2.0,...,2.0,,,,,,,,50,930.626305
1,5,,4804535.0,5,3.0,1.0,1,2.0,2.0,1.0,...,,,,,,,,,50,930.626305
2,5,,4804535.0,5,11.0,1.0,1,3.0,3.0,2.0,...,,,,,,,,,50,930.626305
3,5,,4804536.0,5,11.0,1.0,1,1.0,1.0,2.0,...,,,,,,,,,50,1097.211116
4,5,,4804537.0,5,2.0,1.0,1,1.0,1.0,1.0,...,,,,,,,,,50,827.297496


In [15]:
cg_ft.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71770 entries, 0 to 71769
Data columns (total 58 columns):
AREA            71770 non-null object
CLASE           46815 non-null object
DIRECTORIO      71770 non-null float64
DPTO            71770 non-null object
ESC             71765 non-null float64
HOGAR           71770 non-null float64
MES             71770 non-null object
ORDEN           71770 non-null float64
P6016           71770 non-null float64
P6020           71770 non-null float64
P6030S1         71451 non-null float64
P6030S3         71451 non-null float64
P6040           71770 non-null float64
P6050           71770 non-null float64
P6070           71770 non-null float64
P6071           33348 non-null float64
P6071S1         31643 non-null float64
P6081           71770 non-null float64
P6081S1         11799 non-null float64
P6083           71770 non-null float64
P6083S1         22816 non-null float64
P6090           71770 non-null float64
P6100           66475 non-null float6

### Merge on Migration - Selecting Recent Venezuelan Residents

Before we can make the final dataframe containing our various survey modules, we will first narrow down the Migration Module.  The variable of interest for us will be `P753S3` because this provides us with where (which country) the respondent was residing 12 months ago.  This corresponds to our target variable `P1662` which asks what was the principal reason the respondent left where they were living 12 months ago. 

In order to merge the dataframes, we will need to get a better understanding of this column and its variable value labels so that we can make sure to subset for respondents who were living in Venezuela 12 months ago.

In [16]:
# examine value labels using the pyreadstat meta data container
mm_meta.variable_value_labels['P753S3']

{1.0: 'Estados Unidos',
 2.0: 'España',
 3.0: 'Venezuela',
 4.0: 'Ecuador',
 5.0: 'Panamá',
 6.0: 'Perú',
 7.0: 'Costa Rica',
 8.0: 'Argentina',
 9.0: 'Francia',
 10.0: 'Italia',
 11.0: 'Otro pais'}

Since we are only interested in respondents who were living in Venezuela 12 months before this survey, we will slice the `mm_df` to select these respondents. We will then perform a left merge on this dataframe with the other dataset.

In [17]:
# Make a dataframe containing respondents who were living in Venezuela 12 months ago
ven_mm = mm_df.loc[mm_df['P753S3'] == 3]
ven_mm.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 930 entries, 485 to 62493
Data columns (total 18 columns):
Directorio     930 non-null float64
Secuencia_p    930 non-null float64
Orden          930 non-null float64
P6074          930 non-null float64
P756           930 non-null float64
P756S1         930 non-null object
P756S3         839 non-null float64
P755           930 non-null float64
P755S1         930 non-null object
P755S3         829 non-null float64
P754           2 non-null float64
P753           930 non-null float64
P753S1         930 non-null object
P753S3         930 non-null float64
P752           0 non-null float64
P1662          930 non-null float64
Mes            930 non-null object
Fex_c_2011     930 non-null float64
dtypes: float64(14), object(4)
memory usage: 138.0+ KB


In [18]:
# Merge our Venezuelan born respondents Migration module dataframe with the other merged dataframe
ven_df = pd.merge(ven_mm, cg_ft, how='left', 
                 left_on=['Directorio', 'Secuencia_p', 'Orden'], 
                 right_on=['DIRECTORIO', 'SECUENCIA_P', 'ORDEN'])
ven_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1304 entries, 0 to 1303
Data columns (total 76 columns):
Directorio      1304 non-null float64
Secuencia_p     1304 non-null float64
Orden           1304 non-null float64
P6074           1304 non-null float64
P756            1304 non-null float64
P756S1          1304 non-null object
P756S3          1165 non-null float64
P755            1304 non-null float64
P755S1          1304 non-null object
P755S3          1198 non-null float64
P754            4 non-null float64
P753            1304 non-null float64
P753S1          1304 non-null object
P753S3          1304 non-null float64
P752            0 non-null float64
P1662           1304 non-null float64
Mes             1304 non-null object
Fex_c_2011      1304 non-null float64
AREA            1037 non-null object
CLASE           663 non-null object
DIRECTORIO      1037 non-null float64
DPTO            1037 non-null object
ESC             1037 non-null float64
HOGAR           1037 non-null flo

Right away, we know that there are a few columns we can drop because these are essentially duplicate columns used for linking the various survey dataset modules.

In [19]:
# Drop columns that are basically duplicates
ven_df.drop(['DIRECTORIO', 'SECUENCIA_P', 'ORDEN', 'Fex_c_2011', 'fex_c_2011_y', 
             'fex_c_2011_x', 'REGIS_x', 'REGIS_y'], axis=1, inplace=True)

In [20]:
ven_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1304 entries, 0 to 1303
Data columns (total 68 columns):
Directorio     1304 non-null float64
Secuencia_p    1304 non-null float64
Orden          1304 non-null float64
P6074          1304 non-null float64
P756           1304 non-null float64
P756S1         1304 non-null object
P756S3         1165 non-null float64
P755           1304 non-null float64
P755S1         1304 non-null object
P755S3         1198 non-null float64
P754           4 non-null float64
P753           1304 non-null float64
P753S1         1304 non-null object
P753S3         1304 non-null float64
P752           0 non-null float64
P1662          1304 non-null float64
Mes            1304 non-null object
AREA           1037 non-null object
CLASE          663 non-null object
DPTO           1037 non-null object
ESC            1037 non-null float64
HOGAR          1037 non-null float64
MES            1037 non-null object
P6016          1037 non-null float64
P6020          1037 

We expected there to be as many rows as were available in our subset of ven_mm dataframe. Let's investigate if there are duplicate rows for some reason and remove.

In [23]:
ven_dedup = ven_df.drop_duplicates(keep='first')
len(ven_dedup)

1304

## Investigate NaNs

We now have 68 columns, which we'll want to still narrow down with some more in-depth investigation of the available variables and especially those with null values.

Given that there are so many columns with missing values, let's create a way to see them in order of those with the most missing values.

In [26]:
percent_missing = ven_df.isnull().sum() * 100 / len(ven_df)
missing_value_df = pd.DataFrame({'columns_name': ven_df.columns, 
                                'percent_missing': percent_missing})
missing_value_df.sort_values('percent_missing', ascending=False, inplace=True)
missing_value_df.head(30)

Unnamed: 0,columns_name,percent_missing
P752,P752,100.0
P754,P754,99.693252
P6350,P6350,99.46319
P6330,P6330,99.309816
P6340,P6340,98.849693
P6320,P6320,98.159509
P6120,P6120,97.315951
P6110,P6110,96.395706
P6175,P6175,95.47546
P6269,P6269,95.015337


In [29]:
# make a list of variables with more than 600 NaNs (roughly 45%) in their column
vars_to_drop = []
cols = list(ven_df)

for col in cols:
    if ven_df[col].isnull().sum() > 600:
        vars_to_drop.append(col)

print(len(vars_to_drop))
vars_to_drop

26


['P754',
 'P752',
 'CLASE',
 'P6071',
 'P6071S1',
 'P6081S1',
 'P6083S1',
 'P6100',
 'P6110',
 'P6120',
 'P6150',
 'P6175',
 'P6220',
 'P6269',
 'P6250',
 'P6260',
 'P6270',
 'P6280',
 'P6290',
 'P6300',
 'P6310',
 'P6320',
 'P6330',
 'P6340',
 'P6350',
 'P6351']

Most of our columns have null values. Let's take a quick look at the columns with null values to see if there are any that we can immediately remove based on their relevance to our purposes

In [30]:
# Print column name and column name description (from the survey meta data)
for k in full_meta_dict:
    if k in vars_to_drop:
        print(k, full_meta_dict[k])

P754 El lugar donde vivía ……. hace cinco años era:
P752 El lugar donde vivía ……. hace 12 meses era:
P6081S1 Número de orden de la persona
P6083S1 Número de orden de la persona
P6071 ¿El (la) cónyuge de  ... reside en este hogar?
P6071S1 Número de orden de la persona
P6150 ¿cuántos meses hace que ... No está afiliado o no cotiza a la seguridad social en salud?
P6100 ¿a cual de los siguientes regímenes de seguridad social en salud está afiliado:
P6110 ¿quién paga mensualmente por la afiliación de ...?
P6120 ¿cuánto paga o cuánto le descuentan mensualmente? (si no sabe cuánto paga o cuanto le descuentan escriba 98)
P6175 El establecimiento al que asiste ... ¿es oficial?
P6220 ¿cuál es el título o diploma de mayor nivel educativo que usted ha recibido?
P6269 ¿Se graduó usted de una escuela normal superior?
CLASE Clase
P6290 ¿qué hizo principalmente en las ultimas cuatro semanas ... Para conseguir un trabajo o instalar un negocio?
P6250 Además de lo anterior, ¿... Realizó la semana pasada a

In [None]:
#df = df.dropna(axis=0, subset=['Charge_Per_Line'])

In [31]:
vars_to_drop

['P754',
 'P752',
 'CLASE',
 'P6071',
 'P6071S1',
 'P6081S1',
 'P6083S1',
 'P6100',
 'P6110',
 'P6120',
 'P6150',
 'P6175',
 'P6220',
 'P6269',
 'P6250',
 'P6260',
 'P6270',
 'P6280',
 'P6290',
 'P6300',
 'P6310',
 'P6320',
 'P6330',
 'P6340',
 'P6350',
 'P6351']

In [32]:
ven_df2 = ven_df.drop(axis=0, columns=['P754',
 'P752',
 'CLASE',
 'P6071',
 'P6071S1',
 'P6081S1',
 'P6083S1',
 'P6100',
 'P6110',
 'P6120',
 'P6150',
 'P6175',
 'P6220',
 'P6269',
 'P6250',
 'P6260',
 'P6270',
 'P6280',
 'P6290',
 'P6300',
 'P6310',
 'P6320',
 'P6330',
 'P6340',
 'P6350',
 'P6351'])
ven_df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1304 entries, 0 to 1303
Data columns (total 42 columns):
Directorio     1304 non-null float64
Secuencia_p    1304 non-null float64
Orden          1304 non-null float64
P6074          1304 non-null float64
P756           1304 non-null float64
P756S1         1304 non-null object
P756S3         1165 non-null float64
P755           1304 non-null float64
P755S1         1304 non-null object
P755S3         1198 non-null float64
P753           1304 non-null float64
P753S1         1304 non-null object
P753S3         1304 non-null float64
P1662          1304 non-null float64
Mes            1304 non-null object
AREA           1037 non-null object
DPTO           1037 non-null object
ESC            1037 non-null float64
HOGAR          1037 non-null float64
MES            1037 non-null object
P6016          1037 non-null float64
P6020          1037 non-null float64
P6030S1        1035 non-null float64
P6030S3        1035 non-null float64
P6040       

In [33]:
# Make a list of column names with null values and print number of columns with nulls
null_cols = ven_df2.columns[ven_df2.isna().any()].tolist()
len(null_cols)

29

In [34]:
# Print column name and column name description (from the survey meta data)
for k in full_meta_dict:
    if k in null_cols:
        print(k, full_meta_dict[k])

P756S3 En otro país:
P755S3 ¿Dónde vivía …. , hace cinco años?  En otro pais
HOGAR Hogar
P6016 Número de orden de la persona que proporciona la información
P6020 Sexo
P6030S1 Mes (mm):
P6030S3 Año (aaaa):
P6040 ¿cuántos años cumplidos tiene...? (si es menor de 1 año, escriba 00)
P6050 ¿cuál es el parentesco de ... Con el jefe o jefa del hogar?
P6070 Actualmente:
P6081 ¿El padre de ... reside en este hogar?
P6083 ¿La madre de ... reside en este hogar?
P6090 ¿... Está afiliado, es cotizante o es beneficiario de alguna entidad de seguridad social en salud?
P6140 ¿anteriormente estuvo ... Afiliado, fue cotizante o beneficiario de alguna entidad de seguridad social en salud?
P6125 ¿en los últimos doce meses dejó de asistir al médico o no se hospitalizó, por no tener con que pagar estos servicios en la eps o ars?
P6160 ¿sabe leer y escribir?
P6170 ¿actualmente ... Asiste a la escuela, colegio o universidad?
P6210 ¿cuál es el nivel educativo más alto alcanzado por ... Y el último año o grado 

In [47]:
ven_df2['P6020'].map(area_cg_meta.variable_value_labels['P6020']).value_counts(normalize=True)

Mujer     0.529412
Hombre    0.470588
Name: P6020, dtype: float64

In [48]:
ven_df2['P6020'].isnull().sum()

267

In [50]:
ven_df2.tail()

Unnamed: 0,Directorio,Secuencia_p,Orden,P6074,P756,P756S1,P756S3,P755,P755S1,P755S3,...,P6160,P6170,P6210,P6210S1,FT,P6230,P6240,P6240S1,P6290S1,P6310S1
1299,4829558.0,1.0,1.0,2.0,3.0,,3.0,4.0,,3.0,...,1.0,2.0,5.0,11.0,1.0,1.0,1.0,,,
1300,4829858.0,1.0,4.0,2.0,3.0,,3.0,4.0,,3.0,...,1.0,2.0,3.0,3.0,1.0,4.0,4.0,,,
1301,4829858.0,1.0,5.0,2.0,3.0,,3.0,4.0,,3.0,...,1.0,2.0,4.0,7.0,1.0,5.0,4.0,,,
1302,4829860.0,1.0,2.0,2.0,3.0,,3.0,4.0,,3.0,...,1.0,2.0,6.0,1.0,1.0,2.0,4.0,,,
1303,4829860.0,1.0,2.0,2.0,3.0,,3.0,4.0,,3.0,...,1.0,2.0,6.0,1.0,1.0,2.0,4.0,,,


In [51]:
ven_df2.isnull().sum(axis=1)

0        0
1        0
2       27
3        2
4        2
        ..
1299     0
1300     0
1301     0
1302     1
1303     1
Length: 1304, dtype: int64

## Relabel

In [43]:
# Our Target Variable
mm_df['P1662'].map(mm_meta.variable_value_labels['P1662']).value_counts(normalize=True)

Acompañar a otros miembros del  hogar                           0.468085
Trabajo                                                         0.288754
Amenaza o riesgo para su vida o integridad física por violen    0.118541
Conformación de un nuevo hogar                                  0.043313
Estudio                                                         0.031155
Amenaza o riesgo para su vida o integridad física ocasionada    0.029635
Salud                                                           0.018997
Motivos culturales asociados a grupos étnicos                   0.001520
Name: P1662, dtype: float64

# Explore


## Overview of Survey Dataset Modules:

The following are the most important variables of the statistical operation:

• Housing: Type of housing and physical characteristics (material of walls and floors).

• Household data: Connection to public, private - communal services, value paid for consumption and quality thereof, connection and use of sanitary service, obtaining water for consumption, place and energy to prepare food, disposal of garbage, type of home ownership, home ownership and cell phone ownership.

• Registration of persons: Identification of the habitual resident.

• General characteristics: Sex, age, kinship, marital status.

• Social security in health: Coverage of the General System of Social Security in Health -SGSSS- by regimes, person who pays membership and coverage.

• Education: literacy, school attendance, maximum educational level reached and last year approved or in progress and degrees or diplomas obtained.

• Workforce: PEA (employed and unemployed)

• Employed: Main employment (branch of activity, occupation, type of contract, access to benefits, time worked and occupational position).

• Employees (means of searching, monthly remuneration, overtime, payments in kind, subsidies, bonuses and bonuses).

• Independent (forms of work, commercial register, accounting, profit or net fees).

• Salaried and independent employees (duration of employment, normal and effective hours worked, fees, company size, workplace, Pension affiliation, family compensation fund and ARP, duration between previous and current employment).

• Secondary employment (hours worked, occupational position, monthly remuneration, company size and workplace)

• Companies with inadequate hours and situations of inappropriate employment (by skills and income).

• Quality of employment

• Unemployed: Duration of job search, work history, income and social security.

• Inactive: work history, income and social security.

• Fertility.

• Other activities.

• Non-labor income.

# Model

# Interpret

## Summary of Findings:

## Considerations for Further Analysis

- **Labeled data on propensity for return** could help policymakers understand the kinds of short, medium, and long-term interventions that are required based on a better understanding of the factors and likelihood of return for migrants in Colombia.
- **Analysis of how this classification performs when integrating data on Venezuelan migrants in other countries** (Ecuador, Peru, Chile, Brazil, Spain, Portugal, and the U.S.A.)