# Project 1

# Step1: Read the dataset

I am using the UCDP Battle-Related Deaths dataset (1989-2003), including 1587 rows of data.

I pick the column "bd_best" to conduct the following analysis. According to its codebook, bd_best refers to the UCDP Best estimate for battle-related deaths in the conflict dyad in the given year and is coded as integers.

In [None]:
import pandas as pd

# 1. Load the dataset
df = pd.read_csv("BattleDeaths.csv")

# 2. Display the first few rows of the dataset
print(df.head())

   conflict_id dyad_id location_inc                  side_a side_a_id  \
0        11447   10006     Ethiopia  Government of Ethiopia        97   
1          372   10054         Mali      Government of Mali        72   
2          372   11985         Mali      Government of Mali        72   
3          372   11985         Mali      Government of Mali        72   
4          372   11985         Mali      Government of Mali        72   

                                          side_a_2nd side_b side_b_id  \
0                                                NaN   IGLF      1115   
1                                                NaN   FPLA       934   
2                                                NaN    CMA      1158   
3                                                NaN    CMA      1158   
4  Government of Armenia, Government of Austria, ...    CMA      1158   

  side_b_2nd  incompatibility  ... type_of_conflict  battle_location  gwno_a  \
0        NaN                1  ...        

# Step2: Compute
a. the mean
b. the median
c. the mode

In [None]:
# 1. Compute the mean
mean_bd_best = df["bd_best"].mean()
print(f"Mean of bd_best:{mean_bd_best}")

# 2. Compute the median
median_bd_best = df["bd_best"].median()
print(f"Median of bd_best: {median_bd_best}")

# 3. Compute the mode
mode_bd_best = df["bd_best"].mode()
print(f"Mode of bd_best: {mode_bd_best}")


Mean of bd_best:1491.188524590164
Median of bd_best: 170.0
Mode of bd_best: 0    25
Name: bd_best, dtype: int64


# Step3: Repeat using Python standard library

To repeat the previous two steps in a "hard way" (really hard) using only Python standard library, I will follow the next steps:

1. read the file using "with"
2. Clean the bd_best column using for loop and if
2. Cmpute the mean, median, and mode

In [None]:
import csv

# 1. Read the file
filename = "BattleDeaths.csv"

# Make an empty list to store bd_best values
bd_best_values = []

# Read the file and extract bd-best values
with open(filename, newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)  # read each row as a dictionary
    for row in reader:
        value = row["bd_best"].strip()
        if value != "":  # skip blank cells
            bd_best_values.append(int(value))

# 2. Compute the mean
total = sum(bd_best_values)
count = len(bd_best_values)
mean = total / count

# 3. Calculate the median
sorted_values = sorted(bd_best_values)  # sort the list in order to find the median

if count % 2 == 1:
    median = sorted_values[count // 2]
else:
    middle1 = sorted_values[count // 2 - 1]
    middle2 = sorted_values[count // 2]
    median = (middle1 + middle2) / 2

# 4. Calculate the mode - count how frequent each number appears
counts = {}
for num in bd_best_values:
    if num in counts:
        counts[num] += 1
    else:
        counts[num] = 1

# Find the number with the highest count
max_count = 0
mode = None
for num, freq in counts.items():
    if freq > max_count:
        max_count = freq
        mode = num

print("Mean:", mean)
print("Median:", median)
print("Mode:", mode)


Mean: 1491.188524590164
Median: 170.0
Mode: 25


It seems that pandas show two modes (0 and 25) while codes using python standard library only print 25. The reason could be that there are actually two modes in the list (with equal counts), although the manual code only shows the one that first appears given the feature of for loop.

# Step4: Data Visualization
In this part, I conduct simple data visualization using pandas and python standard library.


In [16]:
import pandas as pd

df = pd.read_csv("BattleDeaths.csv")

# compute the yearly total battle deaths
yearly_totals = df.groupby("year")["bd_best"].sum().sort_index()

# Use scale for better visualization
max_val = yearly_totals.max()
max_bar_length = 60
scale = max_bar_length / max_val

print("Battle-Related Deaths per Year (1989-2003):")
print("Each '*' represents a proportional amount of deaths\n")

for year, total in yearly_totals.items():
    bar_length = int(total * scale)
    bar = "*" * bar_length
    print(f"{year}: {bar} ({total})")


Battle-Related Deaths per Year (1989-2003):
Each '*' represents a proportional amount of deaths

1989: *********** (55184)
1990: ***************** (80297)
1991: *************** (70353)
1992: *********** (53406)
1993: ********* (44949)
1994: ******** (38517)
1995: ******* (36742)
1996: ****** (28885)
1997: ******** (40340)
1998: ******** (40218)
1999: ***************** (81047)
2000: ***************** (78671)
2001: ***** (23592)
2002: **** (21106)
2003: ***** (23194)
2004: **** (19646)
2005: ** (12322)
2006: **** (20163)
2007: **** (19214)
2008: ****** (28741)
2009: ******* (34606)
2010: **** (21088)
2011: ***** (25080)
2012: **************** (74071)
2013: ******************** (93488)
2014: ************************* (115972)
2015: ********************** (104435)
2016: ******************* (90624)
2017: *************** (72023)
2018: *********** (55031)
2019: *********** (52610)
2020: *************** (73228)
2021: ******************************************* (199789)
2022: ******************