# Let's Pull some Data

I created this notebook so that I can isloate my SQL queuries from the rest of my code. 

In [17]:
# Set up the env.
#!conda init
#!conda env list ## to see the availble options
#!conda activate civil_unrest ## to activate the targeted env

In [16]:
# Import the necessary module
from google.cloud import bigquery
from matplotlib import pyplot as plt
import folium
from matplotlib import colors
from geopy.distance import geodesic


## Obtaining Events Along Commute

First let's ensure we are pulling useful information. We want information along the path of the commuter

In [3]:
def expand_rectangle_by_mile(start_lat, start_lon, end_lat, end_lon):
    """
    Expands a rectangle by 1 mile in both latitude and longitude directions.
    Each coordinate is shifted outward by 1/2 mile.

    Parameters:
        start_lat (float): Starting latitude of the rectangle.
        start_lon (float): Starting longitude of the rectangle.
        end_lat (float): Ending latitude of the rectangle.
        end_lon (float): Ending longitude of the rectangle.

    Returns:
        dict: Expanded rectangle coordinates.
              {"start_lat": float, "start_lon": float, "end_lat": float, "end_lon": float}
    """
    # Calculate shifts of 1/2 mile in latitude and longitude
    half_mile_in_lat = geodesic(miles=0.5).destination((start_lat, start_lon), 0).latitude - start_lat
    half_mile_in_lon = geodesic(miles=0.5).destination((start_lat, start_lon), 90).longitude - start_lon

    # Determine the minimal and maximal coordinates
    min_lat = min(start_lat, end_lat)
    max_lat = max(start_lat, end_lat)
    min_lon = min(start_lon, end_lon)
    max_lon = max(start_lon, end_lon)

    # Expand the rectangle
    expanded_start_lat = min_lat - half_mile_in_lat
    expanded_start_lon = min_lon - half_mile_in_lon
    expanded_end_lat = max_lat + half_mile_in_lat
    expanded_end_lon = max_lon + half_mile_in_lon

    return {
        "start_lat": expanded_start_lat,
        "start_lon": expanded_start_lon,
        "end_lat": expanded_end_lat,
        "end_lon": expanded_end_lon
    }

# Example usage:
start_lat, start_lon = 41.91118832433419, -87.67514378155508  # nearby my home
end_lat, end_lon = 41.87300017458362, -87.62765043486581     # nearby my school
expanded_coords = expand_rectangle_by_mile(start_lat, start_lon, end_lat, end_lon)
print(expanded_coords)


{'start_lat': 41.8657555523296, 'start_lon': -87.68484261360975, 'end_lat': 41.91843294658821, 'end_lon': -87.61795160281115}


### Displaying Correct Region
Let's make sure we're pulling data from the right region. What does the region look like?

In [12]:
# Create a map centered around the midpoint of the expanded rectangle
mid_lat = (expanded_coords['start_lat'] + expanded_coords['end_lat']) / 2
mid_lon = (expanded_coords['start_lon'] + expanded_coords['end_lon']) / 2
m = folium.Map(location=[mid_lat, mid_lon], zoom_start=13)

# Add a rectangle to the map
folium.Rectangle(
    bounds=[
        [expanded_coords['start_lat'], expanded_coords['start_lon']],
        [expanded_coords['end_lat'], expanded_coords['end_lon']]
    ],
    color='blue',
    fill=True,
    fill_color='blue',
    fill_opacity=0.2
).add_to(m)

# Add starting point marker
folium.Marker(
    location=[start_lat, start_lon],
    popup='Starting Point',
    icon=folium.Icon(color='green')
).add_to(m)

# Add ending point marker
folium.Marker(
    location=[end_lat, end_lon],
    popup='Ending Point',
    icon=folium.Icon(color='red')
).add_to(m)

# Display the map
m

---
## Pulling Data
Let's use Google Cloud and BigQuery to pull this data from GDETL. Specifically, let's pull data from `GDELT Global Event Database (GDELT 1.0)`

---

In [24]:
# Log in to Google Cloud
!gcloud auth application-default login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login&state=TYtQnfRZATYbgpehZVfmGKFuI5TyK7&access_type=offline&code_challenge=8ReR9ExZfi0XJh9TGRYqgChifnkfJ-Kg7q39XfN30Ko&code_challenge_method=S256


Credentials saved to file: [/Users/warrenweissbluth/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).

Quota project "civil-unrest-predictor" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning the resource.


In [25]:
# Initialize BigQuery client with the project ID
client = bigquery.Client(project="civil-unrest-predictor")

query = f"""
SELECT
    SQLDATE,
    EventCode,
    ActionGeo_FullName,
    ActionGeo_Lat,
    ActionGeo_Long,
    AvgTone
FROM
    `gdelt-bq.full.events`
WHERE
    EventCode IN ('145', '1451', '1452', '1453', '1454')
    AND ActionGeo_Lat BETWEEN {expanded_coords['start_lat']} AND {expanded_coords['end_lat']}
    AND ActionGeo_Long BETWEEN {expanded_coords['start_lon']} AND {expanded_coords['end_lon']}
    AND CAST(SQLDATE AS STRING) >= '20150101'
ORDER BY
    SQLDATE DESC
LIMIT 10000;
"""

# Execute the updated query
query_job = client.query(query)

# Convert results to a DataFrame
data = query_job.result().to_dataframe()

---

Later I'll realize that I'd like to know exactly what time of day the events occur at. This would require upgrading to `GDELT 2.0`. This is something I'll leave to further improvements to my model. I may conduct this in the future.

---

In [26]:
data

Unnamed: 0,SQLDATE,EventCode,ActionGeo_FullName,ActionGeo_Lat,ActionGeo_Long,AvgTone
0,20240823,145,"Union Park, Illinois, United States",41.8839,-87.6648,-3.046968
1,20240822,145,"Union Park, Illinois, United States",41.8839,-87.6648,0.000000
2,20240820,145,"Union Park, Illinois, United States",41.8839,-87.6648,-4.319654
3,20240820,145,"Union Park, Illinois, United States",41.8839,-87.6648,-4.319654
4,20240627,145,"Buckingham Fountain, Illinois, United States",41.8756,-87.6189,-7.052186
...,...,...,...,...,...,...
142,20160320,145,"University Of Illinois At Chicago, Illinois, U...",41.8720,-87.6492,-7.417219
143,20160313,145,"University Of Illinois At Chicago, Illinois, U...",41.8720,-87.6492,-8.571429
144,20160313,145,"University Of Illinois At Chicago, Illinois, U...",41.8720,-87.6492,-8.571429
145,20160312,145,"Chicago Loop, Illinois, United States",41.8811,-87.6298,-2.366864


---

Ok now let's plot it in the region we pulled the data from to ensure it's the correct data.

---

In [27]:
# Add markers for each event in the data DataFrame
for index, row in data.iterrows():
    # Get the color based on the normalized AvgTone value
    folium.Circle(
        location=[row['ActionGeo_Lat'], row['ActionGeo_Long']],
        radius=100,
        fill=True,
        fill_opacity=0.7,
        popup=f"Date: {row['SQLDATE']}, EventCode: {row['EventCode']}, AvgTone: {row['AvgTone']}"
    ).add_to(m)

# Display the updated map
m

---

Let's try to understand how bad an event was. The column, "AvgTone" might be insightful.

Here's the documentation
> ### AvgTone. 
> (numeric) This is the average “tone” of all documents containing one or more
> mentions of this event. The score ranges from -100 (extremely negative) to +100 (extremely
> positive). Common values range between -10 and +10, with 0 indicating neutral. This can be
> used as a method of filtering the “context” of events as a subtle measure of the importance of
> an event and as a proxy for the “impact” of that event. For example, a riot event with a slightly
> negative average tone is likely to have been a minor occurrence, whereas if it had an extremely
> negative average tone, it suggests a far more serious occurrence. A riot with a positive score
> likely suggests a very minor occurrence described in the context of a more positive narrative
> (such as a report of an attack occurring in a discussion of improving conditions on the ground in
> a country and how the number of attacks per day has been greatly reduced).
[![GDELT Data Format Codebook](https://img.shields.io/badge/GDELT%20Data%20Format%20Codebook-Download-blue)](http://data.gdeltproject.org/documentation/GDELT-Data_Format_Codebook.pdf)

Here it's clear that an event with a very negative value would be considered more **impactful**. I'm going to assume this is a proxy for more dangerious. The example mentioned supports this claim. Thus, the lower the "AvgTone," the worse the event was.

Let's first plot the events and scale the colors so that the worse the event, the lower the "AvgTone," the **darker the red**.


---


---

#### Error
Notice that I ensured that the darker color was associated with a more negative event, as would be intuiative.

That's why the following line is "flipped"

`norm = colors.Normalize(vmin=-data['AvgTone'].max(), vmax=-data['AvgTone'].min())`



---

In [28]:
# Define a colormap
colormap = plt.cm.get_cmap('Reds')

# Normalize the AvgTone values to the range [0, 1] and invert the colormap
norm = colors.Normalize(vmin=-data['AvgTone'].max(), vmax=-data['AvgTone'].min())

# Add markers for each event in the data DataFrame
for index, row in data.iterrows():
    # Get the color based on the normalized AvgTone value
    color = colors.rgb2hex(colormap(norm(-row['AvgTone'])))
    folium.Circle(
        location=[row['ActionGeo_Lat'], row['ActionGeo_Long']],
        radius=100,
        color=color,
        fill=True,
        fill_color=color,
        fill_opacity=0.7,
        popup=f"Date: {row['SQLDATE']}, EventCode: {row['EventCode']}, AvgTone: {row['AvgTone']}"
    ).add_to(m)

# Display the updated map
m

  colormap = plt.cm.get_cmap('Reds')


---

We can now see that the data we pulled is relevant to my path to school. We can also see the colors associated with how concerning the event was.

---

---

Save the data to .csv

---

In [29]:
data.to_csv('data.csv', index=False)