Skip to content

ryparmar/fb-related-pages

Repository files navigation

Analysis of Facebook recommended (related) pages

This project is concerned with extracting data from Facebook pages and mining their divergent attitudes towards climate change. The project is part of my girlfriend's thesis, where I helped with scraping the data and then processing it. As the code was not expected to be reused, not much care was given to its cleanliness and readability (tbh, it is ugly). 😥 The thesis is available here, unfortunately only in Czech. The English abstract is available here. In short, the thesis tries to indirectly analyze Facebook's page recommendation algorithm and its effect on the creation of information bubbles in relation to the climate crisis.

related pages

Summary of steps

  1. For each initially selected site (there were 20 in total), all recommended pages were scraped
  2. Step 1. was repeated also for the scraped pages, i.e., two rounds of scraping was done
  3. Pages that was not relevant to climate change were removed. We used a custom tf-idf-like score and hand-picked threshold
  4. Rest of the climate change-related pages were manually annotated, i.e., their attitude towards climate change
  5. A simple analysis of these annotated pages and their relationships was performed to see if FB's recommendation algorithm helps with breaking information bubbles

Result

It's not that bad. FB is trying to recommend non-climate change denial pages a little more often.

Future work

There is a plenty of space for improvement. Some simplifying assumptions were used in this project, for example, each of the recommended pages is considered to be equivalent. However, to get to the last recommended page, the user has to click through. So, obviously, those are not equivalent and should be reweighted.

Repository description:

  • data
    • init_page_label.csv: Manually annotated initial 20 pages based on relation to climate change.
    • labeled_pages.csv: Manually annotated pages (recommended by FB from init_page_label) based on relation to climate change.
    • labeled_posts.csv: Posts for relevant pages downloaded using CrowdTangle and labeled by their stance to climate change.
    • uniq_links1.txt: List of unique initial pages (their url link)
    • uniq_links2.txt: List of unique recommended pages for the uniq_links1 pages (their url link)
    • uniq_relation_data1.csv: Initial pages and their related (recommended) pages. The first column are initial pages and the rest of the columns are recommended.
    • uniq_relation_data2.csv: Recommended pages for initial pages and recommended for that recommended pages. The first column are recommended for initial pages and the rest of the columns are recommended of recommended. For better understaing see the image Sber dat below.
  • src
    • notebooks: directory with various experiments and calculations
      • 10-data-wrangling-analysis.ipynb: various data wrangling, simple analysis and csv files preparation
      • 20-page-classification.ipynb: calculation of relevancy scores and choosing only the climate change relevant pages
      • 30-visualize.ipynb: plotly graph visualizations initial attempts
      • 3*-visualize-labeled-pages*.ipynb: plotly graph visualizations final
      • experimental-crowdtangle-posts.ipynb: experimental notebook - looking at the data downdloaded from CrowdTangle
      • experimental-text-analysis.ipynb: experimental notebook - text analysis playground (not used at the end)
    • scrapers: directory with scrapers
      • pages_content.py: script for scraping the posts for given pages
      • related_pages.py: script for scraping the relations (recommendations) between the pages
  • text: directory with text and figures of diploma thesis
  • interactive_graph_labeled.html: several versions of interactive plotly graph visualization of pages relations

Data collection scheme

sber dat

About

FB related pages project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages