# Linear Regression
Implement a linear regression model to predict upcoming shark attacks using data from the shark attack dataset at https://www.kaggle.com/teajay/global-shark-attacks.

Collaborators:
1. Srujana Devulapally
2. Jayesh Kaushik
3. Sela Grace Koshy

1. First, you'll need to import the necessary libraries and load the dataset into a Pandas dataframe:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset into a Pandas dataframe
shark_attacks_df = pd.read_csv('shark_attacks.csv')

2. Preprocess the data by dropping irrelevant columns, handling missing values, and converting categorical variables to numerical variables as needed.

In [2]:
# Remove the inplace=True parameter and assign the result to a new variable
shark_attacks_df_clean = shark_attacks_df.dropna()

# Drop unnecessary columns
shark_attacks_df.drop(['Case Number', 'Date', 'Year', 'Type', 'Country', 'Area', 'Location', 'Name', 'Injury', 'Time', 'Species', 'Investigator or Source'], axis=1, inplace=True)

# Convert categorical variables to numerical ones
shark_attacks_df['Sex'] = shark_attacks_df['Sex'].map({'M': 1, 'F': 0})
shark_attacks_df['Fatal (Y/N)'] = shark_attacks_df['Fatal (Y/N)'].map({'Y': 1, 'N': 0})

# insert the 'Attacks' column into the DataFrame
shark_attacks_df.insert(0, 'Attacks', shark_attacks_df['Fatal (Y/N)'].apply(lambda x: 1 if x == 'Y' else 0))

shark_attacks_df.dropna(inplace=True)

3. Split the dataset into training and testing sets:

In [3]:
# split the data into training and testing sets
train_size = int(len(shark_attacks_df) * 0.7)
train_set = shark_attacks_df[:train_size]
test_set = shark_attacks_df[train_size:]

4. Define the feature matrix X and target vector y for both the training and testing sets:

In [4]:
# define the feature matrix X and target vector y for both training and testing sets
X_train = np.c_[np.ones((len(train_set), 1)), train_set.drop(['Attacks'], axis=1)]
y_train = np.array(train_set['Attacks']).reshape(-1, 1)
X_test = np.c_[np.ones((len(test_set), 1)), test_set.drop(['Attacks'], axis=1)]
y_test = np.array(test_set['Attacks']).reshape(-1, 1)


5. Implement the normal equation method to solve for the optimal values of theta:

In [12]:
# implement the normal equation method
X_train = X_train.astype('float64')
theta = np.linalg.pinv(X_train.T.dot(X_train)).dot(X_train.T).dot(y_train)

#theta = np.linalg.inv(X_train.T.dot(X_train)).dot(X_train.T).dot(y_train)

6. Use the trained model to make predictions on the testing set:

In [13]:
# use the trained model to make predictions on the testing set
y_pred = X_test.dot(theta)

7. Calculate the root mean squared error (RMSE) to evaluate the performance of the model:

In [15]:
# calculate the root mean squared error (RMSE)
rmse = np.sqrt(np.mean((y_test - y_pred) ** 2))
print(f'RMSE: {rmse}')

RMSE: nan
