# Project 1: HIV Diagnoses in NYC

Dataset: [NYC Open Data - HIV/AIDS Diagnoses by Neighborhood, Sex, and Race/Ethnicity](https://data.cityofnewyork.us/Health/HIV-AIDS-Diagnoses-by-Neighborhood-Sex-and-Race-Et/ykvb-493p)

This project explores HIV diagnosis data in New York City, focusing on basic statistical calculations and visualization using both pandas and the Python standard library.

## Using pandas





In [None]:
import pandas as pd
df = pd.read_csv("HIV.csv")
print("Columns in the dataset:\n", df.columns.tolist())

Columns in the dataset:
 ['YEAR', 'Borough', 'Neighborhood (U.H.F)', 'SEX', 'RACE/ETHNICITY', 'TOTAL NUMBER OF HIV DIAGNOSES', 'HIV DIAGNOSES PER 100,000 POPULATION', 'TOTAL NUMBER OF CONCURRENT HIV/AIDS DIAGNOSES', 'PROPORTION OF CONCURRENT HIV/AIDS DIAGNOSES AMONG ALL HIV DIAGNOSES', 'TOTAL NUMBER OF AIDS DIAGNOSES', 'AIDS DIAGNOSES PER 100,000 POPULATION']


In [8]:
col = "TOTAL NUMBER OF HIV DIAGNOSES"
values = df[col].dropna()
df[col] = pd.to_numeric(df[col].replace(",", "", regex=True), errors="coerce")
values = df[col].dropna()

print("HIV Diagnoses in NYC — Basic Statistics")
print("---------------------------------------")
print("Mean:", df[col].mean())
print("Median:", df[col].median())
print("Mode:", df[col].mode()[0])

HIV Diagnoses in NYC — Basic Statistics
---------------------------------------
Mean: 21.010267857142857
Median: 2.0
Mode: 0.0


## Without pandas (The Hard Way)

In [10]:
import csv

filename = "HIV.csv"
col = "TOTAL NUMBER OF HIV DIAGNOSES"

values = []

# Read the CSV file and collect numeric values
with open(filename, newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        val = row[col].replace(",", "").strip()
        if val not in ("", "*", "NA"):      # Skip empty, masked, or NA values
            try:
                values.append(float(val))
            except ValueError:
                continue

# Compute the mean
mean_val = sum(values) / len(values)

# Compute the median
values.sort()
n = len(values)
if n % 2 == 1:
    median_val = values[n // 2]
else:
    median_val = (values[n // 2 - 1] + values[n // 2]) / 2

# Compute the mode
mode_val = max(values, key=values.count)

# Print results
print("HIV Diagnoses in NYC — Basic Statistics (No pandas)")
print("--------------------------------------------------")
print(f"Mean   : {mean_val:.2f}")
print(f"Median : {median_val:.2f}")
print(f"Mode   : {mode_val:.2f}")


HIV Diagnoses in NYC — Basic Statistics (No pandas)
--------------------------------------------------
Mean   : 21.01
Median : 2.00
Mode   : 0.00


## Data Visualization

In [14]:
# Group by Borough and sum diagnoses
borough_totals = df.groupby("Borough")[col].sum().dropna()

# Sort from highest to lowest
borough_totals = borough_totals.sort_values(ascending=False)

# Set scale so the longest bar fits nicely
max_val = borough_totals.max()
max_width = 40  # max number of characters for the longest bar
scale = max_width / max_val

print("\nHIV Diagnoses in NYC by Borough (text-based chart)")
print("--------------------------------------------------")
for borough, total in borough_totals.items():
    borough = borough.replace("\n", " ").strip()   # remove line breaks
    bar_length = int(total * scale)
    bar = "*" * bar_length
    print(f"{borough:20} | {bar:<40} {int(total)}")


HIV Diagnoses in NYC by Borough (text-based chart)
--------------------------------------------------
All                  | **************************************** 49548
Brooklyn             | *********************                    27212
Bronx                | ******************                       22444
Manhattan            | ***************                          18944
Queens               | **************                           18376
Staten Island        | *                                        2138
