# RFP: Betting on the Bachelor

## Project Overview
You are invited to submit a proposal that answers the following question:

### Who will win season 29 of the Bachelor?

*All proposals must be submitted by **1/15/25 at 11:59 PM**.*

## Required Proposal Components

### 1. Data Description
In the code cell below, read in the data you plan on using to train and test your model. Call `info()` once you have read the data into a dataframe. Consider using some or all of the following sources:
- [Scrape Fandom Wikis](https://bachelor-nation.fandom.com/wiki/The_Bachelor) or [the official Bachelor website]('https://bachelornation.com/shows/the-bachelor/')
- [Ask ChatGPT to genereate it](https://chatgpt.com/)
- [Read in csv files like this](https://www.kaggle.com/datasets/brianbgonz/the-bachelor-contestants?select=contestants.csv)

*Note, a level 5 dataset contains at least 1000 rows of non-null data. A level 4 contains at least 500 rows of non-null data.*

In [3]:
import pandas as pd
from bs4 import BeautifulSoup as BS
import math
import requests
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [98]:
df = pd.read_csv('contestants_level4.csv')

In [100]:
X = df[['Age','Region']]
y = df['ElimWeek']

In [102]:
df.rename(columns={'Number': 'Unnamed:0'}, inplace=True)

In [104]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  500 non-null    int64  
 1   Name        500 non-null    object 
 2   Age         500 non-null    float64
 3   Hometown    500 non-null    object 
 4   ElimWeek    500 non-null    float64
 5   Season      500 non-null    int64  
 6   Region      500 non-null    int64  
dtypes: float64(2), int64(3), object(2)
memory usage: 27.5+ KB


In [106]:
df

Unnamed: 0.1,Unnamed: 0,Name,Age,Hometown,ElimWeek,Season,Region
0,0,Amanda Marsh,23.0,"Chanute, Kansas",11.0,1,4
1,1,Trista Rehn,29.0,"Miami, Florida",6.0,1,2
2,2,Shannon Oliver,24.0,"Dallas, Texas",5.0,1,2
3,3,Kim,24.0,"Tempe, Arizona",4.0,1,4
4,4,Cathy Grimes,22.0,"Terra Haute, Indiana",3.0,1,4
...,...,...,...,...,...,...,...
495,20,Kristina,27.0,"Brooklyn, New York",1.0,7,3
496,21,Kristine,23.0,"Fairfax Station, Virginia",1.0,7,2
497,22,Kyshawn,30.0,"Nashville, Tennessee",1.0,7,2
498,23,Siomara,25.0,"Chicago, Illinois",1.0,7,4


In [108]:
df['Season'].unique()

array([ 1,  2,  5,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,  4,
        6,  7], dtype=int64)

### 2. Training Your Model
In the cell seen below, write the code you need to train a linear regression model. Make sure you display the equation of the plane that best fits your chosen data at the end of your program. 

*Note, level 5 work trains a model using only the standard Python library and Pandas. A level 5 model is trained with at least two features, where one of the features begins as a categorical value (e.g. occupation, hometown, etc.). A level 4 uses external libraries like scikit or numpy.*

In [111]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

In [113]:
model = LinearRegression()
model.fit(X_train, y_train)

In [115]:
slope = model.coef_[0]
intercept = model.intercept_
print(slope, intercept)

-0.06578016038585982 5.769421903003378


In [117]:
X_test

Unnamed: 0,Age,Region
361,25.0,2
73,26.0,3
374,25.0,2
155,26.0,4
104,25.0,3
...,...,...
266,28.0,4
23,29.0,3
222,24.0,4
261,27.0,2


## y = -0.068x + 5.19
### x is age and y is elimweek

### 3. Testing Your Model
In the cell seen below, write the code you need to test your linear regression model. 

*Note, a model is considered a level 5 if it achieves at least 60% prediction accuracy or achieves an RMSE of 2 weeks or less.*

In [121]:
predictions = model.predict(X_test)
predictions = [int(x) for x in predictions]
math.sqrt(mean_squared_error(y_test, predictions))

2.842534080710379

In [123]:
y_test = list(y_test)

In [125]:
count = 0
for i in range(len(predictions)):
    if predictions[i] == y_test[i]:
        count = count + 1

print(count/len(predictions))

0.1


### 4. Final Answer

In the first cell seen below, state the name of your predicted winner. 
In the second cell seen below, justify your prediction using an evaluation technique like RMSE or percent accuracy.

#### My model predicts Alli Jo Hinkes to win

#### I have an RMSE of 2.8 weeks, so I believe Alli will win. This means that the model is off by three weeks, which is kind of a lot to me when considering a show that lasts about nine weeks so the predicted winner is a bit of a stretch. Since my model picked Alli to win, I believe she will prevail. Considering that my accuracy rate was only 10%, you should be cautious because it is very likely that this estimate is inaccurate due to my model's normally poor accuracy. This may be because emotions and personalities can't be captured in data, such as an individual's age and occupation, which are the features I used to train my model.t?