# PROJECT 02 :
-----

### BRIEF : data analysis of global shark attacks for a business idea
You will initially examine the Shark Attack dataset, understanding its structure and formulating a hypothesis or several hypotheses about the data. 
 We hypothesize that shark attacks are more common in certain locations and peak during specific months.
We define a Business Case, such as 

* ‚ÄúAs a company that sells medical products, I want to identify destinations with high shark attack rates.‚Äù
* ‚ÄúAs a company providing supply transportation services, I want to know when and where shark attacks peak to plan the safe transport of medical supplies to hospitals.‚Äù

 Throughout the project, we will use Python and the pandas library to apply at least five data cleaning techniques to handle missing values, duplicates, and formatting inconsistencies. After cleaning, we will perform basic exploratory data analysis to validate our hypotheses and extract insights. 

üìù BUSINESS IDEA ‚Äî 3 Bullet Points

* Problem to Solve: Coastal hospitals and emergency response teams are not always prepared with the right medical supplies during periods of high shark-attack frequency. This business solves the problem by predicting when and where attacks are most likely, so medical supplies can be stocked in advance.

* Business Concept: Use historical shark attack data to create global heatmaps and seasonal risk forecasts. Then provide pharmaceutical products (painkillers, antibiotics, blood bags, emergency kits) and transportation support to hospitals and ambulances near high-risk beaches.

* Data Used to Profit: The business will analyze Country, Date/Month, Gender, Age, Fatality, and Type of Injury to identify high-risk locations, peak attack months, and most common injury types. This allows optimized supply production, targeted sales, and efficient delivery to the areas that need it most.



## üß±COLUMNS TO USE FROM ORIGINAL DATASET:
------

Country (global comparison between countries to invest in more)

    * we just take the country column and make sure every country name is accurately named
    * We are gonna check the column of COUNTRY and make sure every country name is correct
    * We remove NULL COUNTRY data rows that dont have any country
    * We need to make sure that the name of COUNTRY  is capitalized and written the same way for example :
        United States of Amercia == USA == US
        it has to have the same name and consistent!

DATE ( MONTH + YEAR )

    * split the date into day - month columns and only use month column
    * interpret months which have shark attacks happen the most

GENDER ( F or M)

    * We check unique values and make it so it is only two values F or M and deleted all rows that have other values
    * we noticed mostly M get attacked
    * percentages Male to Female victims
    * We remove NULL DATE data rows that dont have a date or that the date doesnt include a month and a year
    * We are gonna check the column of DATE and seperate it into three columns DAY + MONTH + YEAR
    * We verify that the new YEAR column matches with the old YEAR column and keep the ones that match
        * NEW YEAR COLUMN is the one split from the DATE column
        * OLD YEAR COLUMN is the one already existing in the original sheet
    * Once we finish comparing the new YEAR column vs old YEAR column and we find them not matching on some data rows. we remove the none matching ones so we keep clean data of accurate years
    * We remove the DAY and OLD YEAR columns
    * We are gonna keep the MONTH and matching YEAR

AGE (victims age ranges)

    * majority of victims survive
    * we split the age groups into three categories (minors under 18 / adults 18-40 and 40+) 
    * keep in mind complications depending on age when getting treated
    * percentages of victims based on age ranges
    * we split the age groups into three categories (minors under 18 / adults 18-40 and elders 40+) 
    * note that there are complications depending on age

FATALITY ( Y or N)

    * mostly survived for the pharamaceutical logistics & transportation of injured people to the hospital
    * assumption we have a percentage of survivals highest and we use it to sell for the
    * We check unique values and make sure it is only two values Y or N and deleted all rows that have other values
    * assumption we have a high percentage of survivals and we use it to sell the idea to profit from selling products to hospitals

INJURY TYPE

    * clean the type of injury by severity
    * seperate the injury type into different severity
    * seperate the injury type into different body parts
    * treatment depends on type of injury and thus the supplies as well



## RAW DATA
-------

In [84]:
import pandas as pd

df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")
print(df)


                                            Date    Year          Type  \
0                                    25 Aug 2023  2023.0    Unprovoked   
1                                    21 Aug-2023  2023.0  Questionable   
2                                    07-Jun-2023  2023.0    Unprovoked   
3                                    02-Mar-2023  2023.0    Unprovoked   
4                                    18-Feb-2023  2023.0  Questionable   
5                                    08-Feb-2022  2022.0           NaN   
6                                    15-Nov-2021  2021.0    Unprovoked   
7                                    16-Oct-2021  2021.0           NaN   
8                                    10-Sep-2021  2021.0           NaN   
9                                    29-Jul-2020  2020.0    Watercraft   
10                                   06-Jun-2020  2020.0    Unprovoked   
11                                   20-Dec-2019  2019.0      Provoked   
12                                   2

In [85]:
import pandas as pd

df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")
df.head(5)

Unnamed: 0,Date,Year,Type,Country,State,Location,Activity,Name,Sex,Age,Injury,Fatal Y/N,Time,Species,Source
0,25 Aug 2023,2023.0,Unprovoked,AUSTRALIA,New South Wales,"Lighthouse Beach, Port Macquarie",Surfing,Toby Begg,M,44,Severe injuries to lower limbs,,10h00,"White shark, 3.8-4.2m","B. Myatt, & M. Michaelson, GSAF"
1,21 Aug-2023,2023.0,Questionable,BAHAMAS,New Providence Isoad,"Saunders Beach, Nassau",,male,M,20/30,Body found with shark bites. Possible drowning...,,Morning,,"The Tribune, 8/21/2023"
2,07-Jun-2023,2023.0,Unprovoked,BAHAMAS,Freeport,Shark Junction,Scuba diving,Heidi Ernst,F,73,Calf severely bitten,,13h00,Caribbean rreef shark,"J. Marchand, GSAF"
3,02-Mar-2023,2023.0,Unprovoked,SEYCHELLES,Praslin Island,,Snorkeling,Arthur ‚Ä¶,M,6,Left foot bitten,,Afternoon,Lemon shark,"Midlibre, 3/18/2023"
4,18-Feb-2023,2023.0,Questionable,ARGENTINA,Patagonia,Chubut Province,,Diego Barr√≠a,M,32,Death by misadventure,,,,"El Pais, 2/27/2023"


In [86]:
import pandas as pd
df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")
df['Sex'].unique

# NaN , F, M

<bound method Series.unique of 0        M
1        M
2        F
3        M
4        M
5        F
6        M
7        M
8        M
9      NaN
10       F
11     NaN
12       F
13       F
14     NaN
15       M
16       F
17       M
18       M
19       M
20       M
21       M
22       F
23       M
24       F
25     NaN
26       M
27       F
28       F
29       M
30       M
31       M
32       M
33       M
34       M
35     NaN
36       M
37       M
38       M
39       M
40       M
41     NaN
42       M
43       M
44       F
45       M
46       M
47       M
48       M
49       F
50       M
51       M
52       F
53       M
54       M
55       F
56       M
57     NaN
58       M
59     NaN
60       M
61       M
62       M
63       M
64       F
65       M
66       F
67       M
68       M
69       M
70       M
71       M
72       M
73       M
74       M
75       M
76       M
77       F
78       F
79     NaN
80       M
81       M
82       M
83       M
84     NaN
85       M
86       M
87       M
8

In [88]:
import pandas as pd

df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")
df.tail(5)

Unnamed: 0,Date,Year,Type,Country,State,Location,Activity,Name,Sex,Age,Injury,Fatal Y/N,Time,Species,Source
556,1733,1733.0,Invalid,ICELAND,Bardestrand,Talkknefiord,,,,,"Partial hominid remains recovered from shark, ...",,,Shark involvement prior to death unconfirmed,E. Olafsen
557,1723,1723.0,Unprovoked,ROATAN,,,,Philip Ashton,M,,Struck on thigh,,,,"C.Moore, GSAF"
558,Late 1600s Reported 1728,1642.0,Invalid,GUINEA,,,Went overboard,crew member of the Nieuwstadt,M,,FATAL,,,Questionable,"History of the Pyrates, by D. Defoe, Vol. 2, p.28"
559,Before 1824,0.0,Unprovoked,AUSTRALIA,Queensland,Newstead,Swimming,Eullah,F,,Left calf removed,,,,"B. Myatt, GSAF"
560,"No date, Before 1963",0.0,Invalid,USA,Hawaii,"Portlock, Oahu",Diving,Val Valentine,M,,A 4.3 m [14'] shark made threat display. No in...,,,Invalid,B. Sojka & D. Lloyd


In [87]:
import pandas as pd
df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")
df['Fatal Y/N'].unique

# NaN , F, M

<bound method Series.unique of 0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
5     NaN
6     NaN
7     NaN
8     NaN
9     NaN
10    NaN
11    NaN
12    NaN
13    NaN
14    NaN
15    NaN
16    NaN
17    NaN
18    NaN
19    NaN
20    NaN
21    NaN
22    NaN
23    NaN
24    NaN
25    NaN
26    NaN
27    NaN
28    NaN
29    NaN
30    NaN
31    NaN
32    NaN
33    NaN
34    NaN
35    NaN
36    NaN
37    NaN
38    NaN
39    NaN
40    NaN
41    NaN
42    NaN
43    NaN
44    NaN
45    NaN
46    NaN
47    NaN
48    NaN
49    NaN
50    NaN
51    NaN
52    NaN
53    NaN
54    NaN
55    NaN
56    NaN
57    NaN
58    NaN
59    NaN
60    NaN
61    NaN
62    NaN
63    NaN
64    NaN
65    NaN
66    NaN
67    NaN
68    NaN
69    NaN
70    NaN
71    NaN
72    NaN
73    NaN
74    NaN
75    NaN
76    NaN
77    NaN
78    NaN
79    NaN
80    NaN
81    NaN
82    NaN
83    NaN
84    NaN
85    NaN
86    NaN
87    NaN
88    NaN
89    NaN
90    NaN
91    NaN
92    NaN
93    NaN
94    NaN
95    NaN
96    NaN

In [89]:
import pandas as pd
df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")
df.index

RangeIndex(start=0, stop=561, step=1)

In [90]:
df.columns

Index(['Date', 'Year', 'Type', 'Country', 'State', 'Location', 'Activity',
       'Name', 'Sex', 'Age', 'Injury', 'Fatal Y/N', 'Time', 'Species ',
       'Source'],
      dtype='object')

In [91]:
df.columns.to_list()

['Date',
 'Year',
 'Type',
 'Country',
 'State',
 'Location',
 'Activity',
 'Name',
 'Sex',
 'Age',
 'Injury',
 'Fatal Y/N',
 'Time',
 'Species ',
 'Source']

In [92]:
df.shape

(561, 15)

In [93]:
df.describe().round(2)

Unnamed: 0,Year,Fatal Y/N
count,560.0,0.0
mean,1954.74,
std,128.3,
min,0.0,
25%,1934.75,
50%,1968.0,
75%,2005.0,
max,2023.0,


In [94]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 561 entries, 0 to 560
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Date       561 non-null    object 
 1   Year       560 non-null    float64
 2   Type       557 non-null    object 
 3   Country    556 non-null    object 
 4   State      508 non-null    object 
 5   Location   503 non-null    object 
 6   Activity   433 non-null    object 
 7   Name       502 non-null    object 
 8   Sex        480 non-null    object 
 9   Age        238 non-null    object 
 10  Injury     553 non-null    object 
 11  Fatal Y/N  0 non-null      float64
 12  Time       157 non-null    object 
 13  Species    526 non-null    object 
 14  Source     556 non-null    object 
dtypes: float64(2), object(13)
memory usage: 65.9+ KB


In [95]:
import pandas as pd

# Load the CSV
df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")

# Option 1: Display the entire DataFrame (all rows and columns)
pd.set_option('display.max_rows', None)      # Show all rows
pd.set_option('display.max_columns', None)   # Show all columns

df  # Jupyter will render the full table

Unnamed: 0,Date,Year,Type,Country,State,Location,Activity,Name,Sex,Age,Injury,Fatal Y/N,Time,Species,Source
0,25 Aug 2023,2023.0,Unprovoked,AUSTRALIA,New South Wales,"Lighthouse Beach, Port Macquarie",Surfing,Toby Begg,M,44,Severe injuries to lower limbs,,10h00,"White shark, 3.8-4.2m","B. Myatt, & M. Michaelson, GSAF"
1,21 Aug-2023,2023.0,Questionable,BAHAMAS,New Providence Isoad,"Saunders Beach, Nassau",,male,M,20/30,Body found with shark bites. Possible drowning...,,Morning,,"The Tribune, 8/21/2023"
2,07-Jun-2023,2023.0,Unprovoked,BAHAMAS,Freeport,Shark Junction,Scuba diving,Heidi Ernst,F,73,Calf severely bitten,,13h00,Caribbean rreef shark,"J. Marchand, GSAF"
3,02-Mar-2023,2023.0,Unprovoked,SEYCHELLES,Praslin Island,,Snorkeling,Arthur ‚Ä¶,M,6,Left foot bitten,,Afternoon,Lemon shark,"Midlibre, 3/18/2023"
4,18-Feb-2023,2023.0,Questionable,ARGENTINA,Patagonia,Chubut Province,,Diego Barr√≠a,M,32,Death by misadventure,,,,"El Pais, 2/27/2023"
5,08-Feb-2022,2022.0,,COSTA RICA,Guanacoste,Playa Del Coco,Diving,female,F,50,Right forearm and left hand injured,,,"Bull shark, 3m",Diario Extra Del Costa Rica 2/9/2022
6,15-Nov-2021,2021.0,Unprovoked,BRAZIL,S√£o Paulo.,Boqueir√£o Beach,Playing,male,M,11,Minor cuts to left thigh,,12h00,dogfish,"K. McMurray, TrackingSharks.com"
7,16-Oct-2021,2021.0,,AUSTRALIA,Queensland,Sudbury Island,Spearfishing,Torrance Sambo,M,26,Disappeared,,,,"K. McMurray, TrackingSharks.com"
8,10-Sep-2021,2021.0,,EGYPT,,Sidi Abdel Rahmen,Swimming,Mohamed,M,,Laceration to arm caused by metal object,,,No shark invovlement,Dr. M. Fouda & M. Salrm
9,29-Jul-2020,2020.0,Watercraft,AUSTRALIA,Tasmania,Tenth Island,Sightseeing,5.5 m runabout. Occupants: Sean & James Vinar,,,"No injury to occupants, injury to shark attemp...",,09h08,"White shark, 4m","C. Black, GSAF"


In [None]:
import pandas as pd

# Load the CSV
df = pd.read_csv(r"C:\Users\sboub\Documents\GitHub\02_project-shark-attacks\data\raw.csv")


df  # 

In [None]:
df['Country'].unique()

In [None]:
df['Sex'].unique()

In [None]:
df.columns.to_list()

In [None]:
df['Location'].unique()

In [None]:
df['Activity'].unique()

In [None]:
df['Age'].unique()

In [None]:
df['Source'].unique()

In [None]:
df['Date'].unique()

In [None]:
df['Year'].unique()

In [None]:
df['Type'].unique()

In [None]:
df['Fatal Y/N'].unique()

In [None]:
df['Time'].unique()

In [None]:
df['Injury'].unique()

## CLEAN DATA
-------

### IMPORT UTILS and INIT from SRC

In [None]:
import sys, os
sys.path.append(os.path.abspath(".."))
from src import utils
from src import init