**The Geography of Human Flourishing**

Team of the __[Spatial AI-Challenge 2024](https://i-guide.io/spatial-ai-challenge-2024/)__ 
Spatial AI-Challenge 2024: Stefano Iacus, Devika Jain, Andrea Nasuto. 

Other co-authors related to this project: Giuseppe Porro, Marcello Carammia, Andrea Vezzulli.

**Objective**

This notebook provides a comprehensive guide to generating, processing, and visualizing the Geography of Human Flourishing dataset across various geospatial and temporal scales. It walks you through the complete workflow—from data preparation, fine-tuning, and classification, to the construction of indicators and visualization of results. Designed as a step-by-step guide, this notebook covers everything from initial data handling and preprocessing to building and evaluating Human Flourishing indicators. By the end, you'll have a structured dataset and visualizations that can serve as a foundation for more advanced geospatial analyses.

Please use the **Table of Content (TOC)** below navigate the sections.

## Table of Content
1. [Introduction](#Introduction)
2. [Dataset Description](#Dataset-Description)
3. [Methodology](#Methodology)
4. [Results](#Results)
5. [Interpretation](#Interpretation)
6. [Next Steps](#Next-Steps)
7. [Lessons Learned](#Lessons-Learned)
8. [Publications](Publications)
9. [Acknowledgements](Acknowledgements)
10. [Appendix](Appendix)

## Introduction 

### What is Human Flourishing?

__[The Human Flourishing Program](https://hfh.fas.harvard.edu/)__ is a research initiative dedicated to studying and promoting human flourishing across diverse domains of life. It integrates interdisciplinary research from the social sciences, philosophy, psychology, and related fields to advance understanding and practical applications.

__[The Global Flourishing Study (GFS)](https://hfh.fas.harvard.edu/global-flourishing-study)__,  is a five-year longitudinal study involving approximately 200,000 participants from over 20 geographically and culturally diverse countries and territories. It measures global human flourishing across six key dimensions:

- Happiness and life satisfaction
- Mental and physical health
- Meaning and purpose
- Character and virtue
- Close social relationships
- Material and financial stability

### Our Project: Geography of Human Florishing

The Geography of Human Flourishing research project aims to analyze Harvard’s archive of 10 billion geolocated tweets (spanning from 2010 to mid-2023) through the lens of the six dimensions of human flourishing defined by the Global Flourishing Study (GFS).

Using __[fine-tuned large language models (LLMs)](https://arxiv.org/abs/2411.00890)__ , the project extracts 46 indicators aligned with these six domains, generating high-resolution spatio-temporal datasets. 
The selected 46 dimensions of the global human flourishing framework and the way to extract them from tweets are listed below:

1. **Happiness** – the text expresses some level of happiness  
2. **Resilience** – a text expressing capability of withstanding or recovering from difficulties  
3. **Self-esteem** – the text expresses level of confidence in one's worth or abilities  
4. **Life Satisfaction** – the text expresses satisfaction with one's life as a whole  
5. **Fear of future** – the text expresses worry about one's condition in the next years  
6. **Vitality** – the text expresses feelings of strength and activity  
7. **Having energy** – the text expresses that one feels full of energy  
8. **Positive functioning** – the text expresses that one feels capable to do many things  
9. **Expressing job satisfaction** – the text expresses satisfaction with one's present job, all things considered  
10. **Expressing optimism** – the text expresses optimism about one's condition in the medium-run future  
11. **Peace with thoughts and feelings** – the text expresses a general feeling of peace with one's thoughts and feelings  
12. **Purpose in life** – the text expresses understanding of one's purpose in life. In other terms, it expresses the feeling that the things one is doing in his/her life are worthwhile  
13. **Depression** – the text expresses that one is bothered by the following problems: Little interest or pleasure in doing things; Feeling down, depressed or hopeless  
14. **Anxiety** – the text expresses that one is bothered by the following problems: Feeling nervous, anxious or on edge; Not being able to stop or control worrying  
15. **Suffering** – the text expresses the experience of any type of physical or mental suffering  
16. **Feeling pain** – the text expresses the experience of bodily pain currently or in the recent past  
17. **Expressing altruism** – the text expresses willingness to do things that bring advantages to others, even if it results in disadvantage for him/herself  
18. **Loneliness** – the text expresses feelings of loneliness  
19. **Quality of relationship** – the text expresses satisfaction about one's relationships  
20. **Belonging to society** – the text expresses a sense of belonging in one's community  
21. **Expressing gratitude** – the text expresses one's feelings of gratitude for many reasons  
22. **Expressing trust** – the text expresses feeling of trust towards people in one's community  
23. **Feeling trusted** – the text expresses that people in one's community trust one another  
24. **Balance in the various aspects of own life** – the text indicates that the various aspects of one's life are, in general, well balanced  
25. **Mastery (ability, capability)** – the text expresses one's feeling of being very capable in most things one does in life  
26. **Perceiving discrimination** – the text expresses the feeling of being discriminated against because of one's belonging to any group  
27. **Feeling loved by God** – the text expresses one's feeling of being loved or cared for by God, the main god worshipped, or the spiritual force that guides one's life  
28. **Belief in God** – the text expresses belief in one God, or more than one god, or an impersonal spiritual force  
29. **Religious criticism** – the text expresses that people in one's religious community are critical of one's person or one's lifestyle  
30. **Spiritual punishment** – the text expresses the feeling of God, a god, or a spiritual force as a punishing entity  
31. **Feeling religious comfort** – the text expresses finding strength or comfort in one's religion or spirituality  
32. **Financial/material worry** – the text expresses one's worry about being able to meet normal monthly living expenses  
33. **Life after death belief** – the text expresses one's belief in life after death  
34. **Volunteering** – the text expresses one's habit of volunteering one's time to an organization  
35. **Charitable giving/helping** – the text expresses one's habit of donating money to a charity  
36. **Seeking for forgiveness** – the text expresses propensity to forgive those who have hurt us  
37. **Feeling having a political voice** – the text expresses the feeling of having a say about what the government does  
38. **Expressing government approval** – the text expresses approval of the job performance of the national government of one's country  
39. **Having hope** – the text expresses feelings of hope about the future, despite challenges  
40. **Promoting good** – the text shows the propensity of acting to promote good in all circumstances, even in difficult and challenging situations  
41. **Expressing delayed gratification** – the text expresses ability to give up some happiness now for greater happiness later 
42. **PTSD (Post-traumatic stress disorder)** – the text expresses the tendency to be frequently bothered by the big threats to life one has witnessed or personally experienced during one's life  
43. **Describing smoking related health issues** – the text expresses the habit of smoking many cigarettes every day  
44. **Describing drinking related health issues** – the text expresses the habit of frequently drinking full drinks of any kind of alcoholic beverage  
45. **Describing health limitations** – the text indicates any health problems that prevent one from doing any of the things people that age normally can do  
46. **Expressing empathy** – the text expresses ability to share other people's feelings or experiences by imagining what it would be like to be in their own situation  

The initiative also develops interactive tools to visualize and analyze these patterns across space and time.

For the Spatial AI Challenge 2024, the project focuses on a U.S.-based subset of **2.2 billion** geolocated tweets, building interactive dashboards and scalable workflows. To further push the boundaries of spatial AI, the project explores two additional themes—**migration mood** and **perceived corruption**—in parallel with well-being.

These three domains—**well-being**, **migration mood**, and **corruption**—are often studied in pairs (e.g., migration mood vs. happiness, migration and corruption, or corruption and well-being). This project advances the field by examining the dynamic interplay among all three, offering new insights into their complex interrelationships across both space and time.

## Dataset Description 

### Geotweet Archive V2.0

The Harvard Center for Geographic Analysis (CGA) maintains the GeoTweet Archive, a global dataset of tweets spanning across time, geography, and language. This archive covers the period from 2010 to July 12, 2023, when Twitter transitioned its API access from free to a paid model.

The archive contains approximately 10 billion multilingual tweets from around the world (see map below) and is hosted on Harvard University’s High Performance Computing (HPC) cluster.

For more details about the archive and how to access it, please visit __[Geotweets Archive v2.0](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3NCMB6)__.

![Geotweets.png](attachment:436564cb-50cb-4798-b8b9-474169a63fe5.png)

**Sample data preview for Geotweets Archive v2.0 Dataset**

![data_preview.png](attachment:52ac157e-6d19-41d8-a4e8-4b2c0256d4b0.png)


**US Tweets vs Global Tweets Statistics by Year**

![newplot.png](attachment:78bf35d8-51ac-472c-a395-78e622b54e15.png)

![download.png](attachment:51348b64-5d7a-4d75-8bd8-70303662e1d5.png)

## Methodology

Before getting to the final statistics, there are a few steps. The Twitter archive input data look like this:

```

> library(arrow)
> x <- read_parquet("2019-01-01.parquet")
> head(x)
           message_id                date
1 1079890126891819008 2019-01-01 00:00:44
2 1079890769375301632 2019-01-01 00:03:17
3 1079890829471346690 2019-01-01 00:03:31
4 1079891102742839298 2019-01-01 00:04:36
5 1079891393827491840 2019-01-01 00:05:46
6 1079891925895991297 2019-01-01 00:07:53
                                                                text tags
1                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxx hudl
2                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxx <NA>
3                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxx <NA>
4                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxx <NA>
5                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxx <NA>
6                                      xxxxxxxxxxxxxxxxxxxxxxxxxxxxx <NA>
  tweet_lang
1        und
2         en
3         en
4         en
5         en
6         en
                                                                                source
1 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
2 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
3                 <a href="https://algotraffic.com" rel="nofollow">ALGOTweetEngine</a>
4   <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
5   <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
6   <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
            place retweets tweet_favorites photo_url    quoted_status_id
1 Phenix City, AL        0               0      <NA>                  -1
2    Alabama, USA        0               0      <NA>                  -1
3  Huntsville, AL        0               0      <NA>                  -1
4    Alabama, USA        0               0      <NA>                  -1
5    Alabama, USA        0               0      <NA>                  -1
6 Gulf Shores, AL        0               0      <NA> 1079418894538940418
     user_id      user_name       user_location followers friends
1 yyyyyyyyyy       zzzzzzzz                <NA>       106     365
2 yyyyyyyyyy       zzzzzzzz    Spanish Fort, AL       962     785
3 yyyyyyyyyy       zzzzzzzz             Alabama      2320      26
4 yyyyyyyyyy       zzzzzzzz  Camary’s World🎢⛲️       988     908
5 yyyyyyyyyy       zzzzzzzz Huntsville, Alabama       611     507
6 yyyyyyyyyy       zzzzzzzz             Alabama     16127    1645
  user_favorites status user_lang latitude longitude data_source   GPS
1             16    114        NA 32.43837 -85.02636         {1} False
2          44139  14933        NA 32.57623 -86.68074         {1} False
3              0  84303        NA 34.74160 -86.66993         {1}  True
4           2269  47510        NA 32.57623 -86.68074         {1} False
5          32896  38162        NA 32.57623 -86.68074         {1} False
6          88874 113087        NA 30.28683 -87.70657         {1} False
        spatialerror
1  6532.208939719264
2 240277.71220960523
3               10.0
4 240277.71220960523
5 240277.71220960523
6  5488.912574137873
                                                                            geometry
1 01, 01, 00, 00, 00, 94, 30, d3, f6, af, 41, 55, c0, b5, bd, dd, 92, 1c, 38, 40, 40
2 01, 01, 00, 00, 00, 4b, ea, 04, 34, 91, ab, 55, c0, 3d, 09, 6c, ce, c1, 49, 40, 40
3 01, 01, 00, 00, 00, 2e, ab, b0, 19, e0, aa, 55, c0, 57, 5b, b1, bf, ec, 5e, 41, 40
4 01, 01, 00, 00, 00, 4b, ea, 04, 34, 91, ab, 55, c0, 3d, 09, 6c, ce, c1, 49, 40, 40
5 01, 01, 00, 00, 00, 4b, ea, 04, 34, 91, ab, 55, c0, 3d, 09, 6c, ce, c1, 49, 40, 40
6 01, 01, 00, 00, 00, d6, be, 80, 5e, 38, ed, 55, c0, eb, 17, ec, 86, 6d, 49, 3e, 40
  index_right STATEFP20 COUNTYFP20 TRACTCE20 BLOCKCE20         GEOID20
1       48383        01        113    030601      2000 011130306012000
2       44395        01        001    021000      2013 010010210002013
3       41282        01        089    001403      2000 010890014032000
4       44395        01        001    021000      2013 010010210002013
5       44395        01        001    021000      2013 010010210002013
6       10504        01        003    011412      1098 010030114121098
      NAME20 MTFCC20 UR20 UACE20 UATYPE20 FUNCSTAT20 ALAND20 AWATER20
1 Block 2000   G5040    R     NA       NA          S  286822        0
2 Block 2013   G5040    R     NA       NA          S 1089449        0
3 Block 2000   G5040    R     NA       NA          S   41148        0
4 Block 2013   G5040    R     NA       NA          S 1089449        0
5 Block 2013   G5040    R     NA       NA          S 1089449        0
6 Block 1098   G5040    R     NA       NA          S 1642624    17691
   INTPTLAT20   INTPTLON20 __index_level_0__
1 +32.4431725 -085.0218909               104
2 +32.5803510 -086.6776122               329
3 +34.7349681 -086.6701305               361
4 +32.5803510 -086.6776122               492
5 +32.5803510 -086.6776122               582
6 +30.2920963 -087.7110014               745
```

There are three models running in parallel that classify the same tweet and produce numbers;
* human flourishing: e.g., happiness: low (-1), medium (0.5) and high (1), NA indicates that none of the 46 dimensions of human flourishing hase been found;
* migration mood: pro-migration (+1), anti-migration (-1), neutral (0), not about migration (NA);
* perception of corruption: about corruption (1) or not (0).

For each dimension, the calculation is done by aggregating and summing by regional area (census area, county, state), and period (month, year). The calculation is essentially summing up the values and normalizing by the total number of relevant/in topic tweets.

Therefore, all values vary in (-1,+1) with the exception of ```corruption``` which is alwayws a number in [0,1].

The [FlourishingMap Explorer](https://github.com/siacus/flourishingmap) further apply two transforms to improve contrast as most numbers are close to zero. The transformations are: ```log_indicator = log(2+indicator)``` and ```log_corruption = log(1+corruption)``` and then the statistics are normalized again to [-1,+1] (after centering for ```log(2)``` for all indicators but "corruption").

## Results

This notebook presents partial results from the large language model (LLM) analysis of a **2.2 billion tweets** subset from the United States, extracted from Harvard's GeoTweets Archive v2.0. Each raw tweet was enriched with geospatial information linked to **8,180,866 Census Blocks**. The analysis was conducted at the Census Block level and aggregated annually at the County and State levels.

Computation for the full 13-year period across the entire U.S. is ongoing. Updated datasets, including monthly aggregations by county and state, are available—and continuously updated—on the main __[siacus/Flourishing](https://huggingface.co/datasets/siacus/flourishing)__ data repository on Huggingface (that will be constantly updated as data are avaiable). Here you can also find monthly aggregation by county and state as well.

### Interactive Map

You can explore partial results (year >= 2012) for some of the **46** flourishing dimensions on our __[Interactive Map](https://askdataverse.shinyapps.io/FlourishingMap/)__. See demo video below.

In [13]:
from IPython.display import IFrame

IFrame("https://www.youtube.com/embed/tGjo_j8dhQo", width=1000, height=500)



### Data Visualization

Let us explore state and county level **Happiness** for **2020** using the code below.

In [None]:
# run this block to install all import the required libraries. 

import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
from datasets import load_dataset

In [None]:
# Load data from Hugging Face: https://huggingface.co/datasets/siacus/flourishing

state_ds = load_dataset("siacus/flourishing", data_files="flourishingStateYear.parquet") # Load data for states for each year
county_ds = load_dataset("siacus/flourishing", data_files="flourishingCountyYear.parquet") # Load data for county for each year

state_df = state_ds["train"].to_pandas()  
county_df = county_ds["train"].to_pandas()


In [None]:
# Filter for ``Happiness`` for specific year

var = "happiness" #Select ``Happiness`` from the Flourishing data
yr = 2020        #Select 2012 as the year

state_plot = state_df[(state_df["variable"] == var) & (state_df["year"] == yr)].copy() #State-Level data
county_plot = county_df[(county_df["variable"] == var) & (county_df["year"] == yr)].copy() # County-Level data

state_plot["FIPS"] = state_plot["FIPS"].apply(lambda x: f"{int(x):02d}") #State Plot
county_plot["StateCounty"] = county_plot["StateCounty"].apply(lambda x: f"{int(x):05d}") #County Plot


In [None]:
# Load the US states and County shapefiles

states = gpd.read_file("https://huggingface.co/datasets/siacus/flourishing/resolve/main/cb_2021_us_state_20m.zip") # Load US states shapefile
counties = gpd.read_file("https://huggingface.co/datasets/siacus/flourishing/resolve/main/cb_2021_us_county_20m.zip") # Load US counties shape file

states = states[~states["STUSPS"].isin(["AK", "HI", "PR"])] # Remove the states for AK, HI and PR
counties = counties[~counties["STATEFP"].isin(["02", "15", "72"])] # Remove the states for AK, HI and PR


In [None]:
# Merge the States and County Shapefile with Flourishing Data
states["FIPS"] = states["STATEFP"]
state_map = states.merge(state_plot, on="FIPS", how="left") # Merge the State shapefile with State-level Flourishing data

counties["StateCounty"] = counties["STATEFP"] + counties["COUNTYFP"] 
county_map = counties.merge(county_plot, on="StateCounty", how="left") # Merge the County shapefile with State-level Flourishing data


-----------------------------

In [None]:
# Plot maps for State-Level and County-level ``Happiness`` for ``2012``

fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# State Level Map

state_map.plot(
    column="stat",
    cmap="plasma",
    linewidth=0.1,
    ax=axes[0],
    edgecolor="white",
    missing_kwds={"color": "lightgrey"}
)
axes[0].set_title(f"Variable '{var}' by State - {yr}")
axes[0].axis("off")

# County Level Map

county_map.plot(
    column="stat",
    cmap="plasma",
    linewidth=0,
    ax=axes[1],
    edgecolor="white",
    missing_kwds={"color": "lightgrey"}
)
axes[1].set_title(f"Variable '{var}' by County - {yr}")
axes[1].axis("off")

plt.tight_layout()
plt.show()

![Happiness_2020.png](attachment:0a0384d2-7e83-425a-afbb-3e82c60fc05b.png)

## Interpretation

The following section describes the interpretation of some of our key variables over space and time. We also correlate our preliminary finding with other official sources such as __[Transparency International](https://www.transparency.org/en/cpi/2024)__ , __[CDC Places](https://www.cdc.gov/places/index.html)__ and __[World Happiness Report](https://www.gallup.com/analytics/349487/world-happiness-report.aspx)__.

**Note**- These are our preliminary findings, and we are currently conducting a more in-depth analysis to better understand the observed trends.

### Corruption

------------------------------------------------------------------------------------------------------------------------------------
**Our Analysis**

Our analysis shows an increased perception of corruption on social media from **2017-2020**. For instance several states show a darkening of shades in **2020**, signaling a broader rise in corruption discourse. Notably, Wisconsin, Maine, Michigan, and New Hampshire become much darker, indicating a significant spike in corruption-related discussions. Midwestern and Northeastern states see the greatest increases overall. States like Vermont, Montana, and North Dakota remain among the darkest—continuing the trend from 2017. Some states, such as Utah and Nevada, still show low levels of corruption discourse, consistent with 2017. Overall, our analysis shows an increased perception of corruption, erosion of trust in institutions and lack of accountability from 2017 to 2020. The map below shows the rise in perception of corruption from 2017-2020.

![2017_corruption.png](attachment:bde381ab-5c13-487d-b1c0-21d1963a2324.png) 

![corruption_2020.png](attachment:afb6d685-6492-46a7-8a3d-4955a3f3858f.png)


**Transparency International Analysis**

Our finding correlate well with the **Corruption Perceptions Index (CPI)** map published by __[Transparency International](https://www.transparency.org/en/cpi/2024)__. Their results are given on a scale of **0 (highly corrupt)** to **100 (very clean)**. The CPI score for US by Transparency International as shown below continued a downward trend for the U.S., which had scored **75** in 2017 comapred to **67** in 2020. Transparency International attributed the U.S. score drop to: 

- Weakened checks and balances and lack of oversight, particularly around government contracts and COVID-19 stimulus spending.
- Political polarization and attacks on democratic norms, especially during the 2020 election season.
- Concerns about influence of money in politics and lobbying practices.
- Limited enforcement of anti-corruption laws, especially at higher levels of government.

![corruption.png](attachment:675ef0ee-0d2d-401e-b595-e4e6ed85d02b.png)
![CI.png](attachment:7edff246-dc07-4fa0-8086-c0ccebfbe571.png)


### Migration Sentiment

***Our Analysis***

Our analysis reveals widespread positive sentiment toward migration across the U.S. in 2020, even amid a politically polarized environment. Urbanized and coastal states showed the strongest pro-migrant attitudes, but notably, many traditionally conservative regions also exhibited a softening of views. Several Southern and Midwestern states reflected moderate support, aligning with Gallup’s 2020 finding that national support for immigration reached a record high during this period.

![migmood_2020.png](attachment:92275582-48bb-47b1-92f0-f6043d8b182b.png)

**Gallup Survey** 
  
Gallup survey conducted in (June 2020) found that **34%** of Americans supported increasing immigration, the highest ever recorded by Gallup since 1965. Only 28% supported reducing immigration. 36% favored maintaining current levels. This marks the first time in Gallup's trend that the percentage wanting increased immigration has exceeded the percentage who want decreased immigration. This was likely driven by changing demographics and increased awareness of immigrants' roles (e.g., healthcare, food supply during COVID-19).

![Immigration.png](attachment:bb5561b4-d6fe-4b9b-990b-a3321eab0cd8.png)

Support for increased immigration was at historic highs this year among both Democrats and political independents. 

![rjh6004qme-2bl6kbi9unw.png](attachment:2cf52aa2-a590-4034-89ac-ceceb27389e0.png)

Nearly 8 in ten (77%) Americans reported that immigration is a good thing for their country. 

![j2tgxmydpu-sutptbkkmeg.png](attachment:11a772b4-5e67-482c-8f34-a2e3bff7050e.png)


### Happiness

**Our Anaysis**

**2020**

Our analysis of expression of happiness in 2020 reveals a surprising pattern of **overall strong positive sentiment** across much of the United States, despite the ongoing COVID-19 pandemic. Notably, the Midwest and Plains states—including Iowa, Nebraska, Kansas, Minnesota, and South Dakota—displayed deep purple hues, indicating higher-than-average happiness. In contrast, Oregon registered the lowest happiness score, marked by a deep red on the map. Other states with relatively low happiness levels included Tennessee, Georgia, Florida, Louisiana, and parts of the Northeast (notably New York and Massachusetts), suggesting greater emotional distress. Meanwhile, Western and Southern states such as California, Texas, and Arizona exhibited more neutral sentiment, reflected in light beige and pink tones—indicating values near zero on the happiness-sadness scale.

The geographic pattern reveals that the happier states were often rural or midwestern, potentially pointing to: 

- Stronger community ties and mutual aid networks during pandemic
- Lower cost of living and
- Reduced urban stress compared to denser metropolitan areas 

On the other hand, states like Oregon and New York—with higher urban density—may have been more heavily impacted by pandemic-related stress, including lockdowns, isolation, and economic disruption.

Interestingly, the overall color palette of the map in 2020 is noticeably cooler, suggesting a general upswing in happiness across many states, despite the adversity of the time.

This trend may be partly explained by:

- Heightened social cohesion and support systems that became more visible during the crisis

- Government stimulus measures and a shared sense of national purpose, which may have temporarily buffered emotional distress in various parts of the country

Overall, while 2020 was a year of historic challenge, it also revealed regions where collective resilience and positive sentiment remained remarkably strong in large parts of United States.

![happiness.png](attachment:b7a874cf-230e-4bf7-8923-874e5a695b9c.png)

**2017**

We also noticed a drop in expression of happiness from **2020** to **2017** as shown in map below. 2017 shows more expression of happiness across United States compared to 2020. Florida appears to have experienced a happiness decline by 2020, shifting from deep blue to neutral. Oregon remains consistently low across both years. This trend aligns well with the World Happiness Report where the U.S. global ranking decreased from **15th** in 2017 to **18th** in 2020. 

![2017_happiness.png](attachment:e892a090-4323-46f5-a2fb-780f2fc5830d.png)

**2023**

We also observed an increase in the expression of happiness from 2020 to 2023, with noticeably more positive sentiment across the United States in 2023. This trend aligns well with the World Happiness Report, which shows an improvement in the U.S. global ranking—from **18th** in 2020 to **15th** in 2023. The overall expression of happiness in the country improved in 2023 compared to 2020, driven by strong economic indicators and access to social support systems.

![happiness_2023.png](attachment:d9430178-fbc2-4e8c-949b-4152fed12d11.png)


**World Happiness Report**

The U.S. Happiness Index is typically derived from the World Happiness Report, which ranks countries based on survey responses to the Gallup World Poll. The scores are based on answers to the Cantril ladder question, where respondents rate their lives from 0 (worst possible) to 10 (best possible). 

In 2020, the U.S. Happiness Index—reflected in both national averages and state-level data—showed a modest yet noticeable improvement over 2019, despite the disruptions caused by the COVID-19 pandemic. Globally, the U.S. rose to **18th** place in 2020, although this still marked a continuation of a longer-term downward trend from earlier positions such as **14th** in 2017. The ranking improved again in 2023, reaching **15th**. These trends align with our findings: while expressions of happiness increased in 2020 compared to 2019, they remained lower than in 2017 and were surpassed again by 2023 levels.

### Depression

**Our Analysis**

Our findings in *2023* show high level of expression of depression in North Carolina, Tennessee, Montana, Washington and parts of the Midwest and Pacific Northwest. New Mexico is highlighted in bright red, indicating the lowest depression levels (i.e., emotionally well population). Many Western and Midwestern states (e.g., California, Colorado, Illinois, Ohio) show neutral to mildly elevated depression levels. The Northeast appears fairly balanced overall, with mostly pale purple and beige tones.

![Depression_2023.png](attachment:63b95044-989a-4c0e-bda9-445525c0b549.png)

**CDC PLACES**

Our findings corelates well with __[CDC Places](https://experience.arcgis.com/experience/22c7182a162d45788dd52a2362f8ed65)__ findings on **Frequent Mental Distress**. Strong alignment exists in places like North Carolina, Tennessee, and parts of the Midwest and Pacific Northwest, where both our map and CDC map show elevated emotional distress. Montana/Wyoming generally aligns whereas  Southeast (MS, AL, LA) appears more severe in CDC data.

Some discrepancy is seen in New Mexico and Appalachia where the our map reports high well-being, but CDC shows substantial distress—possibly reflecting differences in measurement sources (discourse vs. direct health survey).

![Distress.png](attachment:6e3dc14a-a16c-4961-a533-79d328a4c871.png)

## Next Steps

As next steps, we would like to do the following:

- Finalize the computation of all 46 human flourishing variables at the U.S. state and county levels.

- Scale up the analysis of the 46 variables to a global level, expanding geographic coverage beyond the United States.

- Further correlate our findings with __[AlphaGeo's Global Climate Risk, Resilience and Vulnerability Index](https://docs.alphageo.ai/products/climate-risk-and-resilience-index/the-alphageo-advantage-climate-risk-and-resilience-index)__ which is a unique two-in-one scoring suite of (1) Physical Risk and (2) Resilience-adjusted Risk. Together, these deliver risk and resilience assessments at scale. Explore the dataset on the __[Washington Post column](https://www.washingtonpost.com/climate-environment/interactive/2024/climate-risk-resilience-factors-us-cities/)__.

- Validate health-related variables by correlating them with official public health data from __[CDC Places](https://www.cdc.gov/places/index.html)__.

## Lessons Learned

Although the three clusters use SLURM, OOD and same GPUs, you need to spend some time in figuring out the correct modules to load. SLURM on Anvil does not support --requeue which can be useful in our case. After iterating on three clusters, in the end the scripts are more robust than their original versions.

## Publications


1. Carpi, T., Hino, A., Iacus, S.M., Porro, G. (2022) The Impact of COVID-19 on Subjective Well-Being: Evidence from Twitter Data, Journal of Data Science 21(4), 761-780, __[DOI](https://jds-online.org/journal/JDS/article/1297/info)__.
2. Iacus, S. M., & Porro, G. (Eds.). (2023) Subjective well-being and social media. Routledge. ISBN: 9781032043166 __[LINK](https://www.routledge.com/Subjective-Well-Being-and-Social-Media/Iacus-Porro/p/book/9781032043166?srsltid=AfmBOopDDrHgFJs8bT0jeAnPVwZZfGRq9aUFL6z2fZmQxmMZEqIp9LU_)__.
3. Chai, Y., Kakkar, D., Palacios, J. et al. (2023) Twitter Sentiment Geographical Index Dataset, Sci Data 10, 684, __[DOI](https://www.nature.com/articles/s41597-023-02572-7)__.
4. Carammia, M., Iacus, S.M., Porro, G. (2024) Rethinking Scale: The Efficacy of Fine-Tuned Open-Source LLMs in Large-Scale Reproducible Social Science Research, ArXiv, __[DOI](https://arxiv.org/abs/2411.00890)__.

## Acknowledgements

We extend our sincere thanks to the I-GUIDE team for the opportunity to participate in this challenge. We are especially grateful to Diana Sackton, Shaowen Wang, Anand Padmanabhan, Rajesh Kalyanam, Noah S. Oller Smith, and Nattapon Jaroenchai for their ongoing support and guidance throughout the project. We also acknowledge the Harvard FASRC team, with special thanks to Paul Edmon, for providing the additional computing resources that were essential to this work. We would like to thank Xiaokang Fu from the Harvard Center for Geographic Analysis (CGA) for his assistance in enriching U.S. tweets with Census geography. Finally, we would like to thank Parag Khanna from AlphaGeo for generously sharing the county-level climate and resilience index data for the United States.

## Appendix

### Scripts for Fine-Tuning, Classification and Statistical Analysis 

These directory contain scripts for
* __[finetuning](https://github.com/siacus/flourishing-i-challenge/tree/main/scripts/finetuning)__ of LLMs
* __[classification](https://github.com/siacus/flourishing-i-challenge/tree/main/scripts/classification)__ of raw tweets
* __[construction of statistical indicators](https://github.com/siacus/flourishing-i-challenge/tree/main/scripts/indicators)__

These scripts should work on Anvil, Delta-AI and FASRC clusters. But read below before trying to run them. What follows is a simplified set of instructions for replicability and some notes that we find useful. Some tweaking are inevitable, like changing the account, allocation, SLURM partition names and folders. These scripts assume you have an account on an **ACCESS  Anvil** or **Harvard FARSC**.

### Account Access

#### ACCESS accounts: Anvil

* create an ACCESS account [here](https://operations.access-ci.org/identity/new-user) 
* login via SSH: follow the instructions [here](https://www.rcac.purdue.edu/knowledge/anvil/access/login). Essentially: First login to the web [Open OnDemand interface](https://ondemand.anvil.rcac.purdue.edu) using your ACCESS username and password, and then upload your public key by launching a shell from ODD console.
* configuring VSCODE: I find this [link](https://github.com/KempnerInstitute/kempner-computing-handbook/blob/main/kempner_computing_handbook/development_and_runtime_envs/using_vscode_for_remote_development.md) useful
* general instructions on how to run jobs on Anvil [here](https://www.rcac.purdue.edu/knowledge/anvil/run), and specifically [GPU jobs](https://www.rcac.purdue.edu/knowledge/anvil/run/examples/slurm)
* home directory ```/home/x-siacus``` (adjust)
* project directory:  ```$PROJECT``` or ```/anvil/projects/x-soc250007``` (adjust)
* scratch folder: ```/anvil/scratch/x-siacus/``` (adjust)

#### ACCESS accounts: Delta-AI

* create an ACCESS account [here](https://operations.access-ci.org/identity/new-user) 
* login via SSH: follow the instructions [here](https://docs.ncsa.illinois.edu/systems/deltaai/en/latest/user-guide/login.html#ssh-examples).
* [Open OnDemand interface](https://gh-ondemand.delta.ncsa.illinois.edu/) using your NCSA username and password.
* configuring VSCODE: read this [page](https://docs.ncsa.illinois.edu/systems/deltaai/en/latest/user-guide/vscode/remote-ssh.html)
* general instructions on how to run jobs on Delta-AI [here](https://docs.ncsa.illinois.edu/systems/deltaai/en/latest/user-guide/running-jobs.html#partitions-queues)
* home directory ```/u/siacus``` (adjust)
* project directory:  ```$PROJECT``` or ```/projects/befu/siacus/``` (adjust)

#### Harvard FASRC accounts

* login via SSH: follow the instructions [here](https://docs.rc.fas.harvard.edu/kb/ssh-to-a-compute-node/). Essentially: First login to the web [Open OnDemand interface](https://rcood.rc.fas.harvard.edu/pun/sys/dashboard/) using your FASRC username and password, and then upload your public key by launching a shell from ODD console.
* configuring VSCODE: I find this [link](https://github.com/KempnerInstitute/kempner-computing-handbook/blob/main/kempner_computing_handbook/development_and_runtime_envs/using_vscode_for_remote_development.md) useful
* general instructions on how to run jobs on FASRC [here](https://docs.rc.fas.harvard.edu/kb/running-jobs/), and specifically [GPU jobs](https://docs.rc.fas.harvard.edu/wp-content/uploads/2013/10/GPU_Computing_9_26.pdf)
* home directory ```/n/home11/siacus``` (adjust)
* scratch folder: ```/n/netscratch/siacus_lab``` (adjust)
  
### Setting up a Conda Environment

All scripts run in a conda environment **DO NOT USE MAMBA**!

**How to build the conda environment that can be used for both fine-tuning and inference**

#### Cluster modules for FASRC
```
module load nvhpc/23.7-fasrc01
module load cuda/12.2.0-fasrc01 
module load gcc/12.2.0-fasrc01
```
#### Cluster modules for Anvil
```
module purge
module load anaconda
```
#### Cluster modules for Delta-AI
```
module purge
module load nvhpc-openmpi3/24.3
module load gcc/11.4.0
module load nvhpc-hpcx-cuda12
```
**Building the actual environment**
```
conda create -n cuda python=3.10
conda activate cuda
pip3 install accelerate peft bitsandbytes transformers trl
pip install huggingface-hub 
huggingface-cli login     # [ and pass read/write token]
pip install wandb  # wandb will ask for the same type of authentication on the first use
pip install psutil
pip install pandas tqdm datasets # should be already installed
````
**LLAMA-CPP-PYTHON installation for A100, H100 and H200**
```
conda activate cuda
pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122 \
  --force-reinstall --verbose
```
This is crucial if you use clusters with different versions of NVIDIA GPUs. Testing the environment
After spinning the VMs (see the examples below) always load the modules and the conda environment.

```
conda activate cuda
```
#### On Anvil (adjust the allocation)

generic: 

```sinteractive -p shared  -N 1 -n 4 -A soc250007 -t 2:0:0```

for the gpu:

```sinteractive -p gpu -N 1 -n 4 -A soc250007-gpu --gres=gpu:1 -t 2:0:0```

#### On Delta-AI (adjust the allocation)

generic: 

```sinteractive -p shared  -N 1 -n 4 -A soc250007 -t 2:0:0```

for the gpu:

```
salloc --mem=16g --nodes=1 --ntasks-per-node=1 --cpus-per-task=2 \
  --partition=ghx4 \
  --account=befu-dtai-gh --time=00:30:00 --gpus-per-node=1
```

#### On Harvard FASRC

generic: 

```salloc -p test  --ntasks=1 --cpus-per-task=4 --mem=32G -t 120```

for the gpu: 

```salloc -p gpu_test --gres=gpu:1 --mem=40G -N 1 -t 120```


### Useful SLURM commands

* to know which partitions are available: ```showpartitions```
* to know jour jobs: ```squeue | grep siacus```   # (adjust username)
* to kill one of jour jobs: ```scancel job_num```
* to kill all your jobs: ```scancel -u $USER```