## College Football
Webscrape SEC data from "http://site.api.espn.com/apis/site/v2/sports/football/college-football/scoreboard" for a specific date (YYYYMMDD) using the query parameter "calendar" set to "blacklist".

## Install libraries & get data



In [47]:
# Uncomment if need to install
# %pip install requests

In [48]:
import requests
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [49]:
# Use API
base_url = "http://site.api.espn.com/apis/site/v2/sports/football/college-football/scoreboard"
dates = ["20240829", "20240830", "20240831", "20240901", "20240907", "20240914", "20240921", "20240928", "20241005", "20241012", "20241019", "20241026", "20241102", "20241109", "20241116", "20241123", "20241129", "20241130"]

all_game_data = []

for specific_date in dates:
    complete_url = f"{base_url}?dates={specific_date}&calendar=blacklist"
    response = requests.get(complete_url)

    if response.status_code != 200:
        print(f"Error fetching data for {specific_date}: {response.status_code}")
    else:
        data = response.json()
        if 'events' in data:
            for event in data['events']:
                game_info = {
                    'game_id': event.get('id'),
                    'date': event.get('date'),
                    'status': event.get('status', {}).get('type', {}).get('detail'),
                    'teams': [],
                    'scores': []
                }
                if 'competitions' in event:
                    for competition in event['competitions']:
                        if 'competitors' in competition:
                            for competitor in competition['competitors']:
                                game_info['teams'].append(competitor.get('team', {}).get('displayName'))
                                game_info['scores'].append(competitor.get('score'))
                all_game_data.append(game_info)

df_all_game_data = pd.DataFrame(all_game_data)

## Process data


In [50]:
if response.status_code != 200:
    print(f"Error fetching data: {response.status_code}")
    game_data = []
else:
    data = response.json()
    game_data = []
    if 'events' in data:
        for event in data['events']:
            game_info = {
                'game_id': event.get('id'),
                'date': event.get('date'),
                'status': event.get('status', {}).get('type', {}).get('detail'),
                'teams': [],
                'scores': []
            }
            if 'competitions' in event:
                for competition in event['competitions']:
                    if 'competitors' in competition:
                        for competitor in competition['competitors']:
                            game_info['teams'].append(competitor.get('team', {}).get('displayName'))
                            game_info['scores'].append(competitor.get('score'))
            game_data.append(game_info)

df_game_data = pd.DataFrame(game_data)
display(df_game_data.head())

Unnamed: 0,game_id,date,status,teams,scores
0,401628569,2024-12-01T00:30Z,Final,"[Oregon Ducks, Washington Huskies]","[49, 21]"
1,401628566,2024-11-30T17:00Z,Final,"[Ohio State Buckeyes, Michigan Wolverines]","[10, 13]"
2,401628445,2024-12-01T00:30Z,Final,"[Texas A&M Aggies, Texas Longhorns]","[7, 17]"
3,401628565,2024-11-30T20:30Z,Final,"[Penn State Nittany Lions, Maryland Terrapins]","[44, 7]"
4,401628571,2024-11-30T20:30Z,Final,"[USC Trojans, Notre Dame Fighting Irish]","[35, 49]"


In [51]:
display(df_game_data)

Unnamed: 0,game_id,date,status,teams,scores
0,401628569,2024-12-01T00:30Z,Final,"[Oregon Ducks, Washington Huskies]","[49, 21]"
1,401628566,2024-11-30T17:00Z,Final,"[Ohio State Buckeyes, Michigan Wolverines]","[10, 13]"
2,401628445,2024-12-01T00:30Z,Final,"[Texas A&M Aggies, Texas Longhorns]","[7, 17]"
3,401628565,2024-11-30T20:30Z,Final,"[Penn State Nittany Lions, Maryland Terrapins]","[44, 7]"
4,401628571,2024-11-30T20:30Z,Final,"[USC Trojans, Notre Dame Fighting Irish]","[35, 49]"
5,401635623,2024-11-30T20:30Z,Final,"[Syracuse Orange, Miami Hurricanes]","[42, 38]"
6,401628446,2024-11-30T17:00Z,Final,"[Vanderbilt Commodores, Tennessee Volunteers]","[23, 36]"
7,401635621,2024-11-30T20:30Z,Final,"[SMU Mustangs, California Golden Bears]","[38, 6]"
8,401628564,2024-12-01T00:00Z,Final,"[Indiana Hoosiers, Purdue Boilermakers]","[66, 0]"
9,401628444,2024-11-30T17:00Z,Final,"[Clemson Tigers, South Carolina Gamecocks]","[14, 17]"


In [52]:
num_games = len(df_all_game_data)
print(f"The total number of games collected is: {num_games}")

The total number of games collected is: 790


Filter for SEC teams only

In [53]:
sec_teams = [
    "Alabama Crimson Tide", "Arkansas Razorbacks", "Auburn Tigers", "Florida Gators",
    "Georgia Bulldogs", "Kentucky Wildcats", "LSU Tigers", "Mississippi State Bulldogs",
    "Missouri Tigers", "Oklahoma Sooners", "Ole Miss Rebels", "South Carolina Gamecocks",
    "Tennessee Volunteers", "Texas A&M Aggies", "Texas Longhorns", "Vanderbilt Commodores"
]

def is_sec_game(teams):
    for team in teams:
        if team in sec_teams:
            return True
    return False

sec_games_df = df_all_game_data[df_all_game_data['teams'].apply(is_sec_game)].copy()

display(sec_games_df.head())

Unnamed: 0,game_id,date,status,teams,scores
0,401628327,2024-08-30T00:00Z,Final,"[Missouri Tigers, Murray State Racers]","[51, 0]"
11,401628320,2024-08-29T23:30Z,Final,"[Arkansas Razorbacks, Arkansas-Pine Bluff Gold...","[70, 0]"
22,401628328,2024-08-30T23:00Z,Final,"[Oklahoma Sooners, Temple Owls]","[51, 3]"
28,401628323,2024-08-31T16:00Z,Final,"[Georgia Bulldogs, Clemson Tigers]","[34, 3]"
31,401628331,2024-08-31T19:30Z,Final,"[Texas Longhorns, Colorado State Rams]","[52, 0]"


In [54]:
num_sec_games = len(sec_games_df)
print(f"The total number of SEC games collected is: {num_sec_games}")

The total number of SEC games collected is: 128


# Create a model (for binaries)
Analyze SEC college football data to predict winners and losers using logistic regression. Fix data for modeling

In [55]:
sec_games_processed_df = pd.DataFrame()
sec_games_processed_df['team1'] = sec_games_df['teams'].apply(lambda x: x[0])
sec_games_processed_df['team2'] = sec_games_df['teams'].apply(lambda x: x[1])
sec_games_processed_df['score1'] = sec_games_df['scores'].apply(lambda x: int(x[0]))
sec_games_processed_df['score2'] = sec_games_df['scores'].apply(lambda x: int(x[1]))

sec_games_processed_df['team1_won'] = (sec_games_processed_df['score1'] > sec_games_processed_df['score2']).astype(int)
sec_games_processed_df['score_difference'] = abs(sec_games_processed_df['score1'] - sec_games_processed_df['score2'])

features = sec_games_processed_df[['score1', 'score2', 'score_difference']]
target = sec_games_processed_df['team1_won']

display(sec_games_processed_df.head())
display(features.head())
display(target.head())

Unnamed: 0,team1,team2,score1,score2,team1_won,score_difference
0,Missouri Tigers,Murray State Racers,51,0,1,51
11,Arkansas Razorbacks,Arkansas-Pine Bluff Golden Lions,70,0,1,70
22,Oklahoma Sooners,Temple Owls,51,3,1,48
28,Georgia Bulldogs,Clemson Tigers,34,3,1,31
31,Texas Longhorns,Colorado State Rams,52,0,1,52


Unnamed: 0,score1,score2,score_difference
0,51,0,51
11,70,0,70
22,51,3,48
28,34,3,31
31,52,0,52


Unnamed: 0,team1_won
0,1
11,1
22,1
28,1
31,1


In [56]:
# Split Data
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

print("Training features shape:", X_train.shape)
print("Testing features shape:", X_test.shape)
print("Training target shape:", y_train.shape)
print("Testing target shape:", y_test.shape)

Training features shape: (102, 3)
Testing features shape: (26, 3)
Training target shape: (102,)
Testing target shape: (26,)


In [57]:
# Train Model
model = LogisticRegression()
model.fit(X_train, y_train)

In [58]:
# Evaluate Model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the logistic regression model on the test data: {accuracy:.4f}")

Accuracy of the logistic regression model on the test data: 1.0000


## Predict winners and losers


In [59]:
# Step 1: Create a new DataFrame future_games_df
future_games = df_all_game_data[df_all_game_data['status'] == 'Scheduled'].copy()

if future_games.empty:
    print("No scheduled future games found in the scraped data. Creating a sample future games DataFrame.")
    # Create a sample DataFrame if no future games are available
    future_games_data = {
        'score1': [21, 35, 14],
        'score2': [17, 42, 28],
        'score_difference': [4, 7, 14]
    }
    future_games_df = pd.DataFrame(future_games_data)
else:
    print(f"Found {len(future_games)} scheduled future games.")
    # Process the future games data similarly to how training data was processed
    future_games_processed = pd.DataFrame()
    future_games_processed['team1'] = future_games['teams'].apply(lambda x: x[0])
    future_games_processed['team2'] = future_games['teams'].apply(lambda x: x[1])
    print("The trained model uses score1, score2, and score_difference, which are not available for future games.")
    print("Creating a sample future_games_df for demonstration purposes with dummy feature values.")

    sample_future_games_data = {
        'score1': [21, 35, 14, 45],
        'score2': [17, 42, 28, 21],
        'score_difference': [4, 7, 14, 24]
    }
    future_games_df = pd.DataFrame(sample_future_games_data)


# Predict probability of team1 winning
predicted_probabilities = model.predict_proba(future_games_df)[:, 1]

# Predict the winner
predicted_winners = model.predict(future_games_df)

# Add the predicted probabilities and the predicted winner to future_games_df
future_games_df['predicted_team1_win_prob'] = predicted_probabilities
future_games_df['predicted_winner_class'] = predicted_winners

# Display predictions
display(future_games_df)

No scheduled future games found in the scraped data. Creating a sample future games DataFrame.


Unnamed: 0,score1,score2,score_difference,predicted_team1_win_prob,predicted_winner_class
0,21,17,4,0.982202,1
1,35,42,7,0.004504,0
2,14,28,14,9e-06,0


## Summary:

### Key Findings

*   The analysis focused on SEC regular season college football games in 2024.
*   A total of 128 regular season games for the 16 SEC teams were identified in the dataset.
*   A logistic regression model was trained using game data features: `score1`, `score2`, and `score_difference`.
*   The trained logistic regression model achieved an accuracy of 100% on the test dataset.
*   Due to the lack of actual scores for future scheduled games, predictions were demonstrated using a sample dataset with dummy feature values.

### Future Next Steps

*   The 100% accuracy on the test set using score-based features suggests that the model is likely overfitting or the test set is not representative of future games where scores are unknown.
*   Future steps should involve training a model using features available before the game starts (e.g., team rankings, historical performance, home advantage, maybe even injuried of players) to make meaningful predictions for scheduled games.
