# Week-05 Submission

## Data Description

For this week's submission, I will be using a dataset "Fatalities in the Israeli-Palestinian". This dataset contains information about fatalities that occurred during the conflict between Israel and Palestine in the periode of 2000-2023.

The dataset is sourced from Kaggle's website. Here's the link
[Go to source](https://www.kaggle.com/datasets/willianoliveiragibin/fatalities-in-the-israeli-palestinian)



In [1]:
import pandas as pd

from sklearn.preprocessing import LabelEncoder

## Data Preparation


In [2]:

df = pd.read_csv("./datasets/fatalities.csv")
df.head()

Unnamed: 0,name,date_of_event,age,citizenship,event_location,event_location_district,event_location_region,date_of_death,gender,took_part_in_the_hostilities,place_of_residence,place_of_residence_district,type_of_injury,ammunition,killed_by,notes
0,'Abd a-Rahman Suleiman Muhammad Abu Daghash,2023-09-24,32.0,Palestinian,Nur Shams R.C.,Tulkarm,West Bank,2023-09-24,M,,Nur Shams R.C.,Tulkarm,gunfire,live ammunition,Israeli security forces,Fatally shot by Israeli forces while standing ...
1,Usayed Farhan Muhammad 'Ali Abu 'Ali,2023-09-24,21.0,Palestinian,Nur Shams R.C.,Tulkarm,West Bank,2023-09-24,M,,Nur Shams R.C.,Tulkarm,gunfire,live ammunition,Israeli security forces,Fatally shot by Israeli forces while trying to...
2,'Abdallah 'Imad Sa'ed Abu Hassan,2023-09-22,16.0,Palestinian,Kfar Dan,Jenin,West Bank,2023-09-22,M,,al-Yamun,Jenin,gunfire,live ammunition,Israeli security forces,Fatally shot by soldiers while firing at them ...
3,Durgham Muhammad Yihya al-Akhras,2023-09-20,19.0,Palestinian,'Aqbat Jaber R.C.,Jericho,West Bank,2023-09-20,M,,'Aqbat Jaber R.C.,Jericho,gunfire,live ammunition,Israeli security forces,Shot in the head by Israeli forces while throw...
4,Raafat 'Omar Ahmad Khamaisah,2023-09-19,15.0,Palestinian,Jenin R.C.,Jenin,West Bank,2023-09-19,M,,Jenin,Jenin,gunfire,live ammunition,Israeli security forces,Wounded by soldiers’ gunfire after running awa...


## Size of the data

This data contains 11.124 rows and 16 columns. 

In [3]:
df.shape

(11124, 16)

## Retrieve Missing Values from Dataset

In [4]:
df.isnull().sum()

name                               0
date_of_event                      0
age                              129
citizenship                        0
event_location                     0
event_location_district            0
event_location_region              0
date_of_death                      0
gender                            20
took_part_in_the_hostilities    1430
place_of_residence                68
place_of_residence_district       68
type_of_injury                   291
ammunition                      5253
killed_by                          0
notes                            280
dtype: int64

## Change the data type and fill the missing values

The data type of the column "age" is float64. I will change it to integer. Then, I will fill the missing values with the mean of the column.

In [5]:
df["age"].fillna(df["age"].mean(), inplace=True)
df["age"] = df["age"].astype(int)

df["age"].info()
print(f"The missing values of age now is {df['age'].isnull().sum()}")

<class 'pandas.core.series.Series'>
RangeIndex: 11124 entries, 0 to 11123
Series name: age
Non-Null Count  Dtype
--------------  -----
11124 non-null  int64
dtypes: int64(1)
memory usage: 87.0 KB
The missing values of age now is 0


## Remove rows with missing values in the column "gender"

The numbers of rows with missing values in the "gender" column is relatively small. So I just want to deleted them.

In [6]:
df.dropna(subset="gender", inplace=True)

df["gender"].isnull().sum()

0

## Fill the missing values in the column "took_part_in_hostilities"

The missing values in the column "took_part_in_hostilities" is filled with the mode of the column. This means that the missing values will be filled with the most frequent value in the column.

In [7]:
df["took_part_in_the_hostilities"].fillna(df["took_part_in_the_hostilities"].mode()[0], inplace=True)

print("The missing values of took_part_in_the_hostilities now is {}".format(df["took_part_in_the_hostilities"].isnull().sum()))

The missing values of took_part_in_the_hostilities now is 0


In [8]:
df["took_part_in_the_hostilities"]

0              No
1              No
2              No
3              No
4              No
           ...   
11119    Israelis
11120     Unknown
11121    Israelis
11122          No
11123    Israelis
Name: took_part_in_the_hostilities, Length: 11104, dtype: object

As we can see now, there are no missing values in the dataset. But there's a 'Unknown' value in the column "took_part_in_hostilities". 
Let's see how many rows that have this value.

In [9]:
df["took_part_in_the_hostilities"].str.contains("Unknown").sum()

587

There's 587 rows with 'Unknown' value in the column "took_part_in_hostilities". I will try to change this value with the mode of the column.

In [10]:
unknown_value = df["took_part_in_the_hostilities"].str.contains("Unknown")

df.loc[unknown_value, "took_part_in_the_hostilities"] = df["took_part_in_the_hostilities"].mode()[0]
df["took_part_in_the_hostilities"].str.contains("Unknown").sum()

0

## Drop the column "ammunition" and "notes"

Due to the presence of missing values in the 'ammunition' column, which accounts for nearly 50% of the data, and considering that both columns is not of utmost importance for the current analysis, I have decided to remove this column.

In [11]:
df.drop(columns="ammunition", axis=1, inplace=True)
df.drop(columns="notes", axis=1, inplace=True)

df.columns

Index(['name', 'date_of_event', 'age', 'citizenship', 'event_location',
       'event_location_district', 'event_location_region', 'date_of_death',
       'gender', 'took_part_in_the_hostilities', 'place_of_residence',
       'place_of_residence_district', 'type_of_injury', 'killed_by'],
      dtype='object')

## Fill the missing values in the column "place_of_residence", "place_of_residence_district", and "type_of_injury"

I will fill the missing values in these columns with the mode of the column. 

In [12]:
df["place_of_residence"].fillna(df["place_of_residence"].mode()[0], inplace=True)
df["place_of_residence_district"].fillna(df["place_of_residence_district"].mode()[0], inplace=True)
df["type_of_injury"].fillna(df["type_of_injury"].mode()[0], inplace=True)

df.isnull().sum()

name                            0
date_of_event                   0
age                             0
citizenship                     0
event_location                  0
event_location_district         0
event_location_region           0
date_of_death                   0
gender                          0
took_part_in_the_hostilities    0
place_of_residence              0
place_of_residence_district     0
type_of_injury                  0
killed_by                       0
dtype: int64

## Check missing values

Let's check the missing values again.

In [16]:
df.isnull().sum().sum()

0

## Label Encoding Process

In [13]:
le = LabelEncoder()

for col in df.columns.values:
    if df[col].dtypes == "object":
        data = df[col]
        le.fit(data.values)
        df[col]=le.transform(df[col])

df.head(10)

Unnamed: 0,name,date_of_event,age,citizenship,event_location,event_location_district,event_location_region,date_of_death,gender,took_part_in_the_hostilities,place_of_residence,place_of_residence_district,type_of_injury,killed_by
0,118,2403,32,3,304,18,2,2591,1,1,357,17,5,1
1,10300,2403,21,3,304,18,2,2591,1,1,357,17,5,1
2,323,2402,16,3,223,8,2,2590,1,1,578,7,5,1
3,2984,2401,19,3,8,9,2,2589,1,1,3,8,5,1
4,8562,2400,15,3,198,8,2,2588,1,1,232,7,5,1
5,905,2400,29,3,198,8,2,2589,1,1,232,7,5,1
6,10861,2400,24,3,155,3,0,2588,1,1,272,9,5,1
7,5677,2400,25,3,198,8,2,2588,1,1,233,7,5,1
8,5554,2400,23,3,198,8,2,2588,1,1,233,7,5,1
9,6156,2399,15,3,457,6,2,2587,1,1,544,5,5,1


## Export to CSV

The data is exported to a csv file named "results.csv" and saved in the "dataset" directory.

In [15]:
df.to_csv("./datasets/result.csv", index=False)

# Closing Remarks

Thank you for reading my submission. I truly worked on this assignment on my own.