# Let's Engineer the Features!

First we should load the data.

In [2]:
import pandas as pd

# Load the data
data = pd.read_csv('csv/data_cleaned.csv')

# Display the first few rows of the dataframe
data.head()

Unnamed: 0,Date,EventCode,ActionGeo_FullName,ActionGeo_Lat,ActionGeo_Long,AvgTone
0,2024-08-23,145,"Union Park, Illinois, United States",41.8839,-87.6648,-3.046968
1,2024-08-22,145,"Union Park, Illinois, United States",41.8839,-87.6648,0.0
2,2024-08-20,145,"Union Park, Illinois, United States",41.8839,-87.6648,-4.319654
3,2024-08-20,145,"Union Park, Illinois, United States",41.8839,-87.6648,-4.319654
4,2024-06-27,145,"Buckingham Fountain, Illinois, United States",41.8756,-87.6189,-7.052186


---

Let's create our feature set. We've already engineered our labels.

---

### **Features as a Tensor**

The input data for the model is represented as a 3D tensor:

$$
\mathbf{X} \in \mathbb{R}^{B \times T \times F}
$$

Where:
- $ B $: Number of unique target dates in the label dataset that have at least $ T = 30 $ predecessor events.
- $ T = 30 $: Length of the sequence (time steps).
- $ F = 5 $: Number of features per time step.

---

### **Feature Definition**

For each row in the sequence:
$$
\mathbf{x}_t = [\text{event\_date}_t, \text{lat}_t, \text{long}_t, \text{score}_t, \text{target\_date}]
$$

- $ \text{event\_date}_t $: Date of the historical event at time step $ t $.
- $ \text{lat}_t, \text{long}_t $: Latitude and longitude of the event at time step $ t $.
- $ \text{score}_t $: Score of the event at time step $ t $.
- $ \text{target\_date} $: The specific date for which the model is predicting a score (constant across the sequence).

---

### **Matrix for a Single Sequence**

The feature matrix for one sequence is:

$$
\mathbf{X}_b =
\begin{bmatrix}
\text{event\_date}_1 & \text{lat}_1 & \text{long}_1 & \text{score}_1 & \text{target\_date} \\
\text{event\_date}_2 & \text{lat}_2 & \text{long}_2 & \text{score}_2 & \text{target\_date} \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
\text{event\_date}_{30} & \text{lat}_{30} & \text{long}_{30} & \text{score}_{30} & \text{target\_date}
\end{bmatrix}
$$

Where:
$$
\mathbf{X}_b \in \mathbb{R}^{T \times F}
$$

---


In [3]:
def create_sequences(data, sequence_length=30):
    X = []
    target_dates = []

    # Convert Date column to datetime
    data['Date'] = pd.to_datetime(data['Date'])

    # Sort data by Date
    data = data.sort_values(by='Date')

    # Create a date range from the minimum date in the data to today
    date_range = pd.date_range(start=data['Date'].min(), end=pd.Timestamp.today())

    # Iterate through each date in the date range
    for target_date in date_range:
        # Get rows with event dates before the target date
        historical_data = data[data['Date'] < target_date]

        # Check if there are at least sequence_length rows
        if len(historical_data) >= sequence_length:
            # Get the last sequence_length rows
            tensor = historical_data.iloc[-sequence_length:].copy()
            # Drop the EventCode and ActionGeo_FullName columns
            tensor = tensor.drop(columns=['EventCode', 'ActionGeo_FullName'])
            # Add the target date as a new column
            tensor['TargetDate'] = target_date
            X.append(tensor)
            target_dates.append(target_date)

    return X, target_dates

# Create sequences
X, target_dates = create_sequences(data)


---

Let's see if they were correctly imported.

---

In [16]:
from IPython.display import display

# Display the first sequence
print("First sequence:")
display(X[0].head())

#print("\nSecond sequence:")
#display(X[1].drop(columns=['TargetDate']))

#print("\nThird sequence:")
#display(X[2].drop(columns=['TargetDate']))

# Iterate through rows of X[1] and X[2] and print the differences in one line
for row1, row2 in zip(X[1].drop(columns=['TargetDate']).itertuples(index=False), X[2].drop(columns=['TargetDate']).itertuples(index=False)):
    differences = [val1 - val2 for val1, val2 in zip(row1, row2)]
    #print(f"Differences: {differences}")


First sequence:


Unnamed: 0,Date,ActionGeo_Lat,ActionGeo_Long,AvgTone,TargetDate
3494,2015-01-08,41.85,-87.6501,2.007772,2015-05-07
3490,2015-01-19,41.85,-87.6501,2.494577,2015-05-07
3493,2015-01-19,41.85,-87.6501,2.857143,2015-05-07
3492,2015-01-19,41.85,-87.6501,2.494577,2015-05-07
3491,2015-01-19,41.85,-87.6501,2.494577,2015-05-07



Second sequence:

Third sequence:


---

Looks like the features are propperly engineered into the tensor!

You can un-comment the prints above to compare. It looks correct to me.

---

---

### **Label Definition**

The label for each sequence ($ b $) corresponds to the score for the target date:
$$
y_b = \text{score}_{\text{target\_date}}
$$

The label vector for all batches is:
$$
\mathbf{y} \in \mathbb{R}^B
$$

---

In [5]:
# Load the scored events data
daily_score = pd.read_csv('model_csv/daily_score.csv')

# Display the first few rows of the dataframe
daily_score.head()

Unnamed: 0,Date,Total Score
0,2015-12-25,-589.428382
1,2015-12-26,0.0
2,2015-12-27,0.0
3,2015-12-28,0.0
4,2015-12-29,0.0


In [17]:
# Find the minimum target date in y
min_target_date = min(target_dates)

# Filter sequences and target_dates to only include those with target dates on or after min_target_date
filtered_sequences = [seq for seq, date in zip(X, target_dates) if date >= min_target_date]
filtered_target_dates = [date for date in target_dates if date >= min_target_date]

# Update the sequences and target_dates variables
X = filtered_sequences
target_dates = filtered_target_dates

# Display the number of sequences and target dates after filtering
print(f"Number of sequences after filtering: {len(X)}")
print(f"Number of target dates after filtering: {len(target_dates)}")

Number of sequences after filtering: 3524
Number of target dates after filtering: 3524


---

# The code below has not yet been refactored!

---

In [13]:
# Function to aggregate data by date
def aggregate_data(df):
    df_copy = df.copy()
    df_copy['SQLDATE'] = pd.to_datetime(df_copy['SQLDATE'], format='%Y%m%d')
    aggregated = df_copy.groupby('SQLDATE').agg({'EventCode': 'size', 'AvgTone': 'sum'}).reset_index()
    aggregated.columns = ['SQLDATE', 'Number of Events', 'Total AvgTone']
    return aggregated.set_index('SQLDATE')

# Aggregate and reindex data
data_aggregated = aggregate_data(data)
label_data_aggregated = aggregate_data(label_data)

# Create a date range and reindex both dataframes
all_dates = pd.date_range(start=data_aggregated.index.min(), end=pd.Timestamp.today())
reindex_data = data_aggregated.reindex(all_dates, fill_value=0).reset_index()
reindex_label_data = label_data_aggregated.reindex(all_dates, fill_value=0).reset_index()

# Rename columns
reindex_data.columns = ['SQLDATE', 'Number of Events', 'Total AvgTone']
reindex_label_data.columns = ['SQLDATE', 'Number of Events', 'Total AvgTone']

# Display the resulting DataFrames
reindex_data, reindex_label_data


KeyError: 'SQLDATE'

---

## Saving Test & Train Data

The reindex_data is the propper formulation for the labels of our data set. Let's quickly save test and train sets. I want to capture a really bad protest, and then see if it can predict some of the events in 2023.

Splitting on 2021 seems like a solid choice to capture both types of events.

---

In [None]:
# Split the data into before and after 2021
train_data_labels = reindex_data[reindex_data['SQLDATE'] < '2021-01-01']
test_data_labels = reindex_data[reindex_data['SQLDATE'] >= '2021-01-01']

# Display the resulting dataframes
print("Train Data:")
display(train_data_labels)
print("\nTest Data:")
display(test_data_labels)

NameError: name 'reindex_data' is not defined

In [None]:
# Remove the "Number of Events" column
train_data_labels = train_data_labels.drop(columns=['Number of Events'])
test_data_labels = test_data_labels.drop(columns=['Number of Events'])
reindex_data = reindex_data.drop(columns=['Number of Events'])

# Save the train and test data to CSV files
train_data_labels.to_csv('train_data_labels.csv', index=False)
test_data_labels.to_csv('test_data_labels.csv', index=False)
reindex_data.to_csv('data_labels.csv', index=False)