# PROJECT 02 : SHARK ATTACKS
-----

### BRIEF : data analysis of global shark attacks for a business idea
You will initially examine the Shark Attack dataset, understanding its structure and formulating a hypothesis or several hypotheses about the data. 
 We hypothesize that shark attacks are more common in certain locations and peak during specific months.
We define a Business Case, such as 

* ‚ÄúAs a company that sells medical products, I want to identify destinations with high shark attack rates.‚Äù
* ‚ÄúAs a company providing supply transportation services, I want to know when and where shark attacks peak to plan the safe transport of medical supplies to hospitals.‚Äù

 Throughout the project, we will use Python and the pandas library to apply at least five data cleaning techniques to handle missing values, duplicates, and formatting inconsistencies. After cleaning, we will perform basic exploratory data analysis to validate our hypotheses and extract insights. 

#### üìù BUSINESS IDEA ‚Äî 3 Bullet Points

* Problem to Solve: Coastal hospitals and emergency response teams are not always prepared with the right medical supplies during periods of high shark-attack frequency. This business solves the problem by predicting when and where attacks are most likely, so medical supplies can be stocked in advance.

* Business Concept: Use historical shark attack data to create global heatmaps and seasonal risk forecasts. Then provide pharmaceutical products (painkillers, antibiotics, blood bags, emergency kits) and transportation support to hospitals and ambulances near high-risk beaches.

* Data Used to Profit: The business will analyze Country, Date/Month, Gender, Age, Fatality, and Type of Injury to identify high-risk locations, peak attack months, and most common injury types. This allows optimized supply production, targeted sales, and efficient delivery to the areas that need it most.



## üåÄCOLUMNS TO CLEAN : 
-----

**- COUNTRY (global comparison between countries to invest in more)** @Blanca

    * we just take the country column and make sure every country name is accurately named
    * We are gonna check the column of COUNTRY and make sure every country name is correct
    * We remove NULL COUNTRY data rows that dont have any country
    * We need to make sure that the name of COUNTRY  is capitalized and written the same way for example :
        United States of Amercia == USA == US
        it has to have the same name and consistent!

**- DATE ( MONTH + YEAR )** @Blanca

    * split the date into day - month columns and only use month column
    * interpret months which have shark attacks happen the most

**- GENDER ( F or M )** @Cecilia

    * We check unique values and make it so it is only two values F or M and deleted all rows that have other values
    * we noticed mostly M get attacked
    * percentages Male to Female victims
    * We remove NULL DATE data rows that dont have a date or that the date doesnt include a month and a year
    * We are gonna check the column of DATE and seperate it into three columns DAY + MONTH + YEAR
    * We verify that the new YEAR column matches with the old YEAR column and keep the ones that match
        * NEW YEAR COLUMN is the one split from the DATE column
        * OLD YEAR COLUMN is the one already existing in the original sheet
    * Once we finish comparing the new YEAR column vs old YEAR column and we find them not matching on some data rows. we remove the none matching ones so we keep clean data of accurate years
    * We remove the DAY and OLD YEAR columns
    * We are gonna keep the MONTH and matching YEAR

**- AGE (victims age ranges)** @Samia

    * majority of victims survive
    * we split the age groups into three categories (minors under 18 / adults 18-40 and 40+) 
    * keep in mind complications depending on age when getting treated
    * percentages of victims based on age ranges
    * we split the age groups into three categories (minors under 18 / adults 18-40 and elders 40+) 
    * note that there are complications depending on age

**- FATALITY ( Y or N )** @Cecilia
    * depends on the column 'Type' of injury and if it includes death or high severity
    * mostly survived 
    * for the pharamaceutical logistics & transportation of injured people to the hospital
    * assumption we have a percentage of survivals highest and we use it to sell for the
    * We check unique values and make sure it is only two values Y or N and deleted all rows that have other values
    * assumption we have a high percentage of survivals and we use it to sell the idea to profit from selling products to hospitals

**- INJURY TYPE** @Samia

    * clean the type of injury by severity
    * seperate the injury type into different severity
    * seperate the injury type into different body parts
    * treatment depends on type of injury and thus the supplies as well

-------



### IMPORT UTILS and INIT from SRC

In [None]:
import sys, os
sys.path.append(os.path.abspath(".."))
from src import utils
from src import init

-----
### GENERAL

In [None]:
#import data
import pandas as pd
df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")


In [None]:
import pandas as pd

# Load the CSV
df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")

# Option 1: Display the entire DataFrame (all rows and columns)
pd.set_option('display.max_rows', None)      # Show all rows
pd.set_option('display.max_columns', None)   # Show all columns

#the full table
print(df)  

In [None]:
df.shape

df.columns

-----
### 'Country' CLEANING

In [None]:
df.columns

In [None]:
df['Country']

In [None]:
df['Country'].nunique()

In [None]:
df['Country'].unique()

In [None]:
# COUNTRY

df = df.dropna(subset=["Country"])
df["Country"] = df["Country"].str.strip().str.title()

country_replacements = {
    "United States Of America": "USA",
    "Usa": "USA",
    "Us": "USA",
    "U.s.": "USA",
    "United Kingdom": "UK",
    "England": "UK",
    "Uk":"UK",
    "Brasil": "Brazil",
    "M√©xico": "Mexico",
}

df["Country"] = df["Country"].replace(country_replacements)
df = df[df["Country"] != "Italy / Croatia"]

df["Country"].describe()

----------
### 'Date' CLEANING AND SEPERATE INTO DD MM YYYY COLUMNS

In [None]:
df.columns

In [None]:
import pandas as pd

# Example data to practice Date cleaning and seperating
data = {
    'date': [
        '12 05 2023',
        '23 07-2021',
        '05-08 2022',
        '17-11-2020'
    ]
}

df = pd.DataFrame(data)

# Regex pattern to match all variations
pattern = r'(?P<DD>\d{2})[- ](?P<MM>\d{2})[- ](?P<YYYY>\d{4})'

# Extract DD, MM, YYYY into new columns
df[['DD', 'MM', 'YYYY']] = df['date'].str.extract(pattern)

print(df)

In [None]:
import pandas as pd
import re

# DataFrame with the 'Date' column
df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv") # or pd.DataFrame({'Date': [...your list...]})

# Function to parse dates-------------------------------------------------------------------------------------------------------------------
def parse_date(date_str):
    date_str = str(date_str).strip()
    
    # Remove 'Reported ' prefix
    date_str = re.sub(r'^Reported\s+', '', date_str, flags=re.IGNORECASE)
    
    # Patterns to match
    patterns = [
        r'(?P<DD>\d{1,2})[- ](?P<MM>[A-Za-z]+)[- ](?P<YYYY>\d{4})',  # DD MMM YYYY
        r'(?P<MM>[A-Za-z]+)[- ](?P<YYYY>\d{4})',                      # MMM-YYYY
        r'(?P<YYYY>\d{4})',                                           # YYYY only
    ]
    
    for pat in patterns:
        match = re.match(pat, date_str)
        if match:
            return match.groupdict()
    
    return {'DD': None, 'MM': None, 'YYYY': None}
#--------------------------------------------------------------------------------------------------------------------------------------------
# Call Function to the column
parsed_dates = df['Date'].apply(parse_date).apply(pd.Series)

# Merge back to original DataFrame
df = pd.concat([df, parsed_dates], axis=1)

df.head(10)


-----
### 'Sex' CLEANING

In [None]:
df.columns

In [None]:
df['Sex']

In [None]:
df['Sex'].info

In [None]:
df['Sex'].unique

In [None]:
# NaN , F, M
df['Sex'].value_counts(dropna=False)

In [None]:

df['Sex'] = df['Sex'].replace({
    'm' : 'M', 
    'f' : 'F', 
    'Male' : 'M', 
    'Female' : 'F'})

df['Sex'] = df['Sex'].str.upper()

In [None]:

df.dropna (subset = ['Sex'], inplace= True)
df['Sex'].value_counts(dropna=False)

In [None]:
df.columns

In [None]:
df.columns.to_list()

In [None]:
df.shape

In [None]:
df['Age'].describe().round(2)

In [None]:
df.info()

In [None]:
df['Age'].info()
#238 non-null object

In [None]:
df['Age'].unique()

In [None]:
df['Age'].nunique()

In [None]:
df['Age'].describe()

-----
### 'Age' CLEANED

In [None]:
df['Age'].info()

In [None]:
df['Age'].describe()

In [None]:
df['Age'].unique()

In [None]:

df['Age'].nunique()


In [None]:
import pandas as pd
import numpy as np
import re

# Load CSV
df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")

def clean_age(value):
    if pd.isna(value):
        return np.nan
    
    value = str(value).strip().lower()

    # Remove common unwanted characters
    value = value.replace("years", "").replace("ca.", "").replace("about", "").strip()

    # Words that indicate no age
    if value in ["n/a", "na", "none", "?", "", "unknown"]:
        return np.nan

    # Teen keyword (assume average age 15)
    if "teen" in value:
        return 15

    # mid-20s -> 25
    if "mid" in value and "20" in value:
        return 25

    # Capture numbers if present
    nums = re.findall(r"\d+", value)

    if len(nums) == 1:
        return int(nums[0])
    
    # If multiple numbers like "20/30", "24 & 35", "16 to 18", take the average
    if len(nums) >= 2:
        nums = [int(n) for n in nums]
        return sum(nums) / len(nums)

    return np.nan

# Apply cleaning function
df['Age_clean'] = df['Age'].apply(clean_age)

# Convert to numeric properly
df['Age_clean'] = pd.to_numeric(df['Age_clean'], errors='coerce')

# Fill missing values with median
median_age = df['Age_clean'].median()
df['Age_clean'].fillna(median_age, inplace=True)

# Confirm results
print(df['Age_clean'].head(20))
print("Median age used:", median_age)

# (Optional) Save cleaned file
df.to_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\cleaned.csv", index=False)


In [18]:
df['Age'].describe().round().astype(int)

count     238
unique     72
top        17
freq       14
Name: Age, dtype: int64

In [None]:
import pandas as pd
import numpy as np
import re

# DATA .csv
df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")
# def Function : clean_age(value=age)
def clean_age(value):
    if pd.isna(value):
        return np.nan
    
    value = str(value).strip().lower()

    # Remove useless text
    value = value.replace("years", "").replace("year", "").replace("ca.", "").replace("about", "").strip()

    # Words meaning no age
    if value in ["n/a", "na", "none", "?", "", "unknown"]:
        return np.nan

    # Teen keyword (approx.)
    if "teen" in value:
        return 15

    # mid-20s ‚Üí 25
    if "mid" in value and "20" in value:
        return 25

    # Extract numbers
    nums = re.findall(r"\d+", value)

    # Single number
    if len(nums) == 1:
        return int(nums[0])
    
    # Multiple numbers -> average them (20/30, 23 & 20, 16 to 18)
    if len(nums) >= 2:
        nums = [int(n) for n in nums]
        return sum(nums) / len(nums)

    return np.nan

# Apply cleaner
df['Age_clean'] = df['Age'].apply(clean_age)

# Convert to numeric
df['Age_clean'] = pd.to_numeric(df['Age_clean'], errors='coerce')

# Fill missing with median
median_age = df['Age_clean'].median()
df['Age_clean'].fillna(median_age, inplace=True)

# Convert to integer (no floats)
df['Age_clean'] = df['Age_clean'].round().astype(int)

# Show result
print(df[['Age', 'Age_clean']].head(20))
print("Median age used:", median_age)

# Save cleaned dataset
df.to_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\cleaned.csv", index=False)

# Check result
print("\n",df["Age_clean"].value_counts())

In [None]:
percentiles = df['Age_clean'].describe(percentiles=[0.25, 0.5, 0.75])

print("Age Percentile Statistics:")
print(f"Min:   {int(percentiles['min'])}")
print(f"25%:   {int(percentiles['25%'])}")
print(f"50% (Median): {int(percentiles['50%'])}")
print(f"75%:   {int(percentiles['75%'])}")
print(f"Max:   {int(percentiles['max'])}")

# the majority were people up to 26 years old

In [None]:
# Replace the original Age column with the cleaned integer values
df['Age'] = df['Age_clean']

# Optionally, drop the temporary Age_clean column
df.drop(columns=['Age_clean'], inplace=True)

# Check the result
print(df['Age'].head())

-----
### 'Type' & 'Fatal Y/N'
**Fatality depends on Type of injury and severity**

In [None]:
import pandas as pd
import numpy as np

def classify_injury(injury, fatal):
    if pd.isna(injury):
        injury = ""
    injury = injury.lower()
    fatal = str(fatal).strip().upper()

    # HOAX / FALSE
    if "hoax" in injury or "false" in injury:
        return "Hoax / False report"

    # NOT A SHARK
    if any(x in injury for x in ["not a shark", "stingray", "barracuda", "propeller", "fish"]):
        return "Not a shark"

    # NO INJURY
    if "no injury" in injury or "no injuries" in injury:
        return "No injury"

    # PROVOKED
    if "provoked" in injury:
        return "Provoked incident"

    # MISSING
    if any(x in injury for x in ["missing", "disappeared", "body not recovered"]):
        return "Missing / Unknown"

    # FATAL CASES
    if fatal == "Y":
        # drowned then bitten after death
        if "post-mortem" in injury or ("drown" in injury and "post" in injury):
            return "Fatal | Drowned, shark scavenged"
        if "unconfirmed" in injury or "probable" in injury:
            return "Fatal | Unconfirmed shark involvement"
        return "Fatal | Shark confirmed"

    # NON-FATAL CASES
    if fatal == "N":
        if any(x in injury for x in ["bitten", "lacerat", "puncture", "wound", "abrasion"]):
            return "Non-fatal | Confirmed shark bite"
        # Text but no bite
        return "Non-fatal | Other"

    # Unknown
    return "Unknown"

# Apply to dataframe
df["Type"] = df.apply(lambda row: classify_injury(row["Injury"], row["Fatal Y/N"]), axis=1)


# Check RESULTS--------------------------------------------------------------------------------------------------------------

# Count number of records in each injury type
counts = df["Type"].value_counts()

# Calculate percentages
percentages = (counts / len(df)) * 100

# Create summary DataFrame
summary = pd.DataFrame({
    "Cases": counts,
    "Percentage": percentages.round(2)
})

# Save summary to CSV
summary.to_csv("injury_type_summary.csv", index=True)

# Print SUMMARY
# print(summary)

print("Injury Type Summary:\n")
for category in counts.index:
    print(f"{category}: {counts[category]} cases  ({percentages[category]:.2f}%)")



In [None]:
df.head(5)

#### CLASSIFY INJURY SEVERITY

In [None]:
df['Type'].describe().round()

In [None]:
import re

# Function to classify injury severity
def classify_severity(val):
    if pd.isna(val):
        val = ""
    val = val.lower()
    
    if re.search(r'fatal|death|died|deceased', val):
        return "Death"
    if re.search(r'amputation|severed|severely bitten', val):
        return "Severe injury"
    if re.search(r'bitten|bite|laceration|cuts|fracture', val):
        return "Minor injury"
    if re.search(r'no injury|not injured|unharmed', val):
        return "No injury"
    if re.search(r'possible drowning|unknown', val):
        return "Uncertain"
    
    return "Other/Unknown"

# Apply to create Injury_Severity column
df["Injury_Severity"] = df["Injury"].apply(classify_severity)

# Now derive Fatal Y/N based on Injury_Severity
def derive_fatal(severity):
    if severity == "Death":
        return "Y"
    elif severity in ["Severe injury", "Injury", "No injury", "Uncertain", "Other/Unknown"]:
        return "N"
    return "N"

df["Fatal Y/N"] = df["Injury_Severity"].apply(derive_fatal)

# Check the first rows
print(df[["Injury", "Injury_Severity", "Fatal Y/N"]].head(20))


In [None]:
df['Fatal Y/N'].describe()