#**Victims of armed conflict**

###Names of the members:

*  Sofia Gomez Rodriguez - 2242086
*  Victor Andres Martinez Preciado - 2240805
*  Juan Esteban Paredes Alvarez - 2240567
*  Sofia Reyes Molina - 2240094
*  Salome Rivas Marulanda -2242055



# **General Description of the Dataset:**

The dataset contains information about victims of the armed conflict reported in Cali, recorded from June 5, 2012, to November 30, 2023.


## **Description of the Columns:**
**code_sspm :** Sequential numbering of each row.

**date_processing:** Date on which the form or document was completed.

**code_municipality:** Postal code of the municipality.

**municipality:** Name of the municipality where the victimizing event was reported.

**sex:** The gender under which the person is registered on their identification document (male or female).

**ethnic_group:** This column indicates the ethnic group to which the victim belongs, such as Indigenous, Afro-Colombian, Mestizo, etc.

**victimization_fact:**  Refers to violations of International Humanitarian Law (IHL) and Human Rights (HR) that occurred within the framework of Article 3 of Law 1448.

**web_report_date:** Date on which the report was submitted through the web platform.

**departament:**  This column represents the geographic location where the victims of the armed conflict are recorded. In this dataset, it only includes the department of Valle del Cauca.

**date_of_birth:** This column shows the victim's date of birth, including day, month, and year.

**years_in_the_visitia:** This column indicates the age of the person at the time they were registered as a victim of the armed conflict.

**commune:** This column represents the district (comuna) where the victim lived and where the armed conflict event occurred.

**cut-off_date:** Refers to the last date on which each record was updated; it represents the cutoff date up to which the data was collected.

# **Extraction Phase**

##Link to where the dataset was taken from

https://datos.cali.gov.co/tl/dataset/poblacion-victima-del-conflicto-armado/resource/2751d517-53f9-4abb-9046-e69af95676d3

In [1]:
#import necessary libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [2]:
#Conect with github to load the data
!git clone https://github.com/vamphook972/armed-conflict-ETL-project.git
%cd armed-conflict-ETL-project/

Cloning into 'armed-conflict-ETL-project'...
remote: Enumerating objects: 22, done.[K
remote: Counting objects: 100% (22/22), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 22 (delta 5), reused 15 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (22/22), 822.87 KiB | 2.18 MiB/s, done.
Resolving deltas: 100% (5/5), done.
/content/armed-conflict-ETL-project


In [3]:
# dataset path

df_victims1 = pd.read_csv('./data/raw/data-population-victims-of-armed-conflict.csv',encoding='latin-1',header=None,sep=';')

  df_victims1 = pd.read_csv('./data/raw/data-population-victims-of-armed-conflict.csv',encoding='latin-1',header=None,sep=';')


#Exploration and cleanup phase

In [4]:
#To see the names of the columns
df_victims1.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,codigo_sspm,FechaDiligenciamiento,CodMunicipio,Municipio,Departamento,sexo,Grupo étnico,fechadenacimiento,Edad años en la visita,Comuna,HechoVictimizante,fechadecorte,fechareporteweb
1,1,05-jun-2012,76001,Cali,Valle del Cauca,Masculino,Mestizo,16-jun-1964,47,8,Desplazamiento forzado,30-nov-2023,4-dic-2023


In [5]:
#to see the size of the df
df_victims1.shape

(72128, 13)

In [6]:
# We used . info() to extract important information from the dataset
df_victims1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72128 entries, 0 to 72127
Data columns (total 13 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       71151 non-null  object
 1   1       71151 non-null  object
 2   2       71151 non-null  object
 3   3       71151 non-null  object
 4   4       71151 non-null  object
 5   5       71151 non-null  object
 6   6       71151 non-null  object
 7   7       71151 non-null  object
 8   8       71151 non-null  object
 9   9       71144 non-null  object
 10  10      64794 non-null  object
 11  11      71151 non-null  object
 12  12      71151 non-null  object
dtypes: object(13)
memory usage: 7.2+ MB


In [7]:
# The headings did not correspond to their meaning (consecutive numbers) so we decided to change and normalize column names
#Using snake case
headers = ["code_sspm","date_processing","code_municipality","municipality","departament", "sex","ethnic_group",
         "date_of_birth","years_in_the_visitia","commune", "victimization_fact","cut-off_date","web_report_date"]

df_victims1.columns = headers

In [8]:
# delete repeated columns and reset the index
df = df_victims1.drop([0]).reset_index(drop=True)

In [9]:
df.head()

Unnamed: 0,code_sspm,date_processing,code_municipality,municipality,departament,sex,ethnic_group,date_of_birth,years_in_the_visitia,commune,victimization_fact,cut-off_date,web_report_date
0,1,05-jun-2012,76001,Cali,Valle del Cauca,Masculino,Mestizo,16-jun-1964,47,8,Desplazamiento forzado,30-nov-2023,4-dic-2023
1,2,05-jun-2012,76001,Cali,Valle del Cauca,Femenino,Mestizo,27-oct-1970,41,8,Desplazamiento forzado,30-nov-2023,4-dic-2023
2,3,05-jun-2012,76001,Cali,Valle del Cauca,Masculino,Mestizo,04-oct-1996,15,8,Desplazamiento forzado,30-nov-2023,4-dic-2023
3,4,05-jun-2012,76001,Cali,Valle del Cauca,Femenino,Mestizo,12-jun-2007,4,8,Desplazamiento forzado,30-nov-2023,4-dic-2023
4,5,25-jun-2012,76001,Cali,Valle del Cauca,Femenino,Afrodescendiente,20-may-1970,42,8,Desplazamiento forzado,30-nov-2023,4-dic-2023


In [10]:
#With .describe() we get important metrics like(count,unique,top,freq)
df_victims1.describe()

Unnamed: 0,code_sspm,date_processing,code_municipality,municipality,departament,sex,ethnic_group,date_of_birth,years_in_the_visitia,commune,victimization_fact,cut-off_date,web_report_date
count,71151.0,71151,71151,71151,71151,71151,71151,71151,71151,71144,64794,71151,71151
unique,71151.0,2593,3,2,2,3,11,23691,211,70,13,2,2
top,71150.0,16-nov-2018,76001,Cali,Valle del Cauca,Femenino,Mestizo,31-dic-1938,10,15,Desplazamiento forzado,30-nov-2023,4-dic-2023
freq,1.0,403,65535,71150,71150,41080,24680,23,1775,12815,60499,71150,71150


# Duplicate record analysis

In [11]:
# See duplicates
df.duplicated().sum()

np.int64(976)

In [12]:
#to check for duplicate IDs
df['code_sspm'].duplicated().sum()

np.int64(976)

In [13]:
# Identify nulls in the df
df.isna().sum()

Unnamed: 0,0
code_sspm,977
date_processing,977
code_municipality,977
municipality,977
departament,977
sex,977
ethnic_group,977
date_of_birth,977
years_in_the_visitia,977
commune,984


In [14]:
#To see the percentage of nulls in each column (victimization_fact and commune)
percentage_nulls_fact = (((df["victimization_fact"].isna().sum())/df.shape[0])*100 )
print(f'The percentage of null values in the victimization_fact column is: {percentage_nulls_fact : .2f}%')

percentage_nulls_commune = (((df["commune"].isna().sum())/df.shape[0])*100 )
print(f'The percentage of null values in the commune column is: {percentage_nulls_commune : .4f}%')

The percentage of null values in the victimization_fact column is:  10.17%
The percentage of null values in the commune column is:  1.3643%


In [15]:
#fill in blanks in victimization_fact and commune
df['victimization_fact'] = (df['victimization_fact']).fillna('No Registra')
df['commune'] = (df['commune']).fillna('No Registra')

In [16]:
# See datatype of every column
df.dtypes

Unnamed: 0,0
code_sspm,object
date_processing,object
code_municipality,object
municipality,object
departament,object
sex,object
ethnic_group,object
date_of_birth,object
years_in_the_visitia,object
commune,object


In [17]:
#to work with different age
age = df['years_in_the_visitia']

In [18]:
df['years_in_the_visitia'].unique()

array(['47', '41', '15', '4', '42', '27', '0', '81', '59', '16', '22',
       '3', '40', '17', '14', '34', '8', '35', '6', '57', '28', '25',
       '66', '87', '48', '10', '71', '36', '39', '19', '13', '5', '37',
       '11', '9', '77', '51', '55', '29', '49', '18', '56', '20', '12',
       '76', '33', '50', '21', '7', '31', '2', '67', '64', '44', '-8',
       '73', '30', '52', '1', '45', '38', '-1', '75', '46', '43', '24',
       '61', '32', '23', '62', '26', '68', '60', '79', '82', '89', '74',
       '80', '54', '63', '85', '65', '53', '72', '70', '86', '69', '58',
       '84', '88', '91', '-4', '100', '98', '78', '-6', '-3', '83', '93',
       '90', '-2', '94', '92', '95', '96', '-5', '-9', '116', '99', '102',
       '115', '-12', '105', '97', '-90', 16.0, 11.0, 10.0, 67.0, 52.0,
       42.0, 22.0, 18.0, 40.0, 32.0, 5.0, 49.0, 9.0, 36.0, 29.0, 2.0,
       56.0, 59.0, 54.0, 23.0, 53.0, 34.0, 63.0, 73.0, 17.0, 12.0, 6.0,
       19.0, 7.0, 44.0, 37.0, 8.0, 14.0, 24.0, 61.0, 15.0, 33.0,