Readme
-
You can travel and get a whole picture, like Snyder or the guy who lived in the hills.
-
We mostly use data, or some systematic information about the world.
-
Data, for most of us most of the time, sounds like a spreadsheet:
- A rectangular grid with cells, a matrix.
- A survey.
- Economic growth in countries.
- Democracy over time.
-
We use it to get something interesting like:
- Central tendencies in all the data.
- Relationship between two variables or columns.
-
Mostly statistical commands and some visualization of results are needed.
-
We have much more information about the world:
- We can have lots of text.
- The location of rivers next to cities.
- Audio files from parliamentary speeches.
- The human genome.
-
Knowing what can be done, beyond the spreadsheet paradigm, is important.
-
We will show some of the ways of dealing with different data.
-
We:
- Will create a community of users.
- Will have people contributing workshops.
-
The model is:
- Exposure
-
STATA: Spreadsheet-like interface, command log, statistical focus, proprietary software ($$).
-
Python (Py): A general-purpose programming language, free and open-source, with extensive libraries for statistics and data science.
-
R: Positioned between a programming language and dedicated statistical software, free and open-source.
-
Python and R are free and rely on user-added capabilities through packages and libraries.
-
All three have graphical interfaces (though some are more extensive) and can be run on your local computer
-
In terms of jobs in the wider world (on a scale of 1 to 100):
- Python: 90
- R: 30
- STATA: 5
-
We take something in the world, look at it, and transform it into something else using our chosen tool.
-
What we take can be big or small, and you may encounter performance issues if you try to store ит in your computer's memory.
-
Big-ness (the sheer volume of data) is only part of the issue.
-
What can be loaded and processed efficiently by each tool is a crucial consideration.
-
For example, Leo Tolstoy's War and Peace contains approximately 587,287 words.
-
Let's load the text of War and Peace in STATA, R, and Python.
-
Before diving into specific languages, let's talk about Jupyter Notebooks.
-
Wouldn't it be nice to be able to run all kinds of code (e.g., Python, R, even potentially STATA through kernels) in a single interactive environment?
- So you get a log of what happened.
- The ability to add annotations and explanations alongside your code.
- And the output of your code (results, visualizations) displayed directly below.
-
Jupyter Notebooks allow you to do exactly this, running from an HTML page, either on your local computer or on a remote server.
- In many cases, the notebook itself can become your paper or report, integrating code, results, and narrative.
-
We will also touch upon generative AI.
-
Let's open a Python Jupyter Notebook.
-
Our goal is to count:
- The total number of words in Leo Tolstoy's War and Peace.
- How many times the main character(s) are mentioned.
-
To help us with this task, let's leverage the capabilities of a generative AI model like ChatGPT. We can ask it for Python code to:
- Read the text of War and Peace from a file.
- Split the text into individual words.
- Count the total number of words.
- Identify and count mentions of the main character(s) (we'll need to decide who those are!).
-
We will change to Downloads
-
And make sure that the file warandpeace.txt is there
- So, I went to ChatGPT and asked the following prompt:
"I have a text file called
warandpeace.txtin my working directory. Give me Python code which counts and displays the total number of words in the file."
- And here's the kind of Python code ChatGPT might provide:
try:
with open('warandpeace.txt', 'r', encoding='utf-8') as file:
text = file.read()
words = text.split()
total_word_count = len(words)
print(f"The total number of words in warandpeace.txt is: {total_word_count}")
except FileNotFoundError:
print("Error: The file 'warandpeace.txt' was not found in the working directory.")
except Exception as e:
print(f"An error occurred: {e}")
```
## Let's Ask ChatGPT for STATA and R Examples
- We can also ask ChatGPT how to perform similar analyses in STATA and R:
- **STATA:** "Give me STATA code to count the number of times 'Bezukhov' and 'Rostova' appear in a text file called warandpeace.txt."
- **R:** "Give me R code to count the number of times 'Bezukhov' and 'Rostova' appear in a text file called warandpeace.txt."
- Furthermore, we can explore more complex tasks:
- **STATA & V-Dem:** "Give me STATA code to load the V-Dem dataset and display the number of countries classified as democracies each year over time."
- **R & Party Data:** "Give me R code using ggplot2 to create a density plot of the ideological positions of political parties in Eastern Europe (assuming I have a dataset with party names and their ideological scores)."
- By prompting ChatGPT with specific questions related to these software packages and our analytical goals, we can get guidance on syntax, commands, and even suggestions for relevant packages or approaches.
- This highlights the potential of generative AI to assist us across different data analysis environments.
## Let's Recap
- **Python is very flexible:**
- Packages like **Pandas** (data manipulation), **NumPy** (numerical computing), **SpaCy** and **NLTK** (natural language processing) make statistics, web-scraping, and language analysis easier.
- **R is very versatile:**
- Probably the **best all-around tool for creating statistical graphs and visualizations**.
- Also functions as a programming language, allowing it to perform many of the same tasks as Python.
- **STATA excels at handling structured, rectangular datasets (like spreadsheets):**
- It has **dedicated statistical support** and built-in commands for common econometric and statistical analyses.
- **General Trends:**
- **Economics:** Often leans towards R and STATA.
- **Data Science and Digital Humanities:** Frequently utilize Python.
- **Political Science:** Has a strong tradition of using STATA.
- **But really, best to use all these tools when appropriate and find ways to pass objects and data between them** to leverage strengths.
## Other Random Things Worth Knowing
- **GitHub:** A platform for version control and collaboration on code and other files. Essential for managing projects and sharing your work.
- **Google Colab:** Share your notebook so that it turns into a google doc - others can access it and run it in real time
- **LaTeX:** A powerful typesetting system widely used for creating professional-looking documents, especially those with mathematical formulas, scientific notation, and consistent formatting.
- **Surveys:** A fundamental method for collecting data about opinions, behaviors, and characteristics of a population. Understanding survey design and analysis is crucial.
- **Presentations:** Effective communication of your findings is key. Learning to create engaging and informative presentations (like this one!) is a valuable skill.
- **GIS (Geographic Information Systems):** Tools and technologies for analyzing and visualizing spatial data, such as maps, locations, and geographic features.
- **Canvas (or other Learning Management Systems):** Platforms often used for educational purposes, sharing materials, and facilitating online learning and collaboration.
- **= Simple text!** Don't underestimate the power of plain text files for storing and exchanging data and information in a simple and universal format.
## Image Classification and Detection
- You can **train a model to recognize one class of images** and distinguish it from another — this is often very useful.
- Alternatively, you can **use a pre-trained model** (for example, to identify whether images contain faces, or protests).
- The model can tell whether one class of images differs from another — though it **won’t tell you why**.
- This involves **labelling** and **classification** — assigning meaning to patterns.
- Some models can **detect boundaries or shapes** — that is, identify the company that pixels keep.
- **Convolutional Neural Networks (CNNs)** are one common deep-learning model for such tasks.
- Possible application: detecting **fraudulent ballots**.
## From Project to Thesis
- A project like this can easily become a **credible MA or even BA thesis**.
- We do everything in **Python**, while showcasing opportunities to:
- Work directly with the **file system**,
- **Extract and save** image-based information.
- Example dataset:
- [EP MPs Banner Images — Full Set](https://gunet-my.sharepoint.com/:f:/g/personal/panagiotis_nikolakopoulos_gu_se/EnhZA0yCpcpFnv7bmljQsZwB8owO4Z0QgNN4fAH-KEyxVQ?e=SbVMA7)
- EP_MPs_BannerImage_Log_FullSet.xlsx says more about the images
- Suggested workflow:
- Place images in Google Drive folder: `ep_member_banner_img`
- Resize as appropriate
- Label faces (or not)
- Extract features and **merge** results
## A Beginner-Friendly Primer
- For an accessible introduction to machine vision and image analysis, see:
[Seeing Like a Machine: A Beginner’s Guide to Image Analysis in Machine Learning](https://www.datacamp.com/tutorial/seeing-like-a-machine-a-beginners-guide-to-image-analysis-in-machine-learning)
- A short, practical read that complements today’s discussion nicely.
## Fingertips of Fraud APSR by Francisco Cantu
- You have 50,000 tallies from voting section, you think many can be forged/fraudulent
- You can infer fraud by looking at unusual markings and deletions
- You have two options:
- **manual** you or an assistant can spend time doing this
- **machine-learning** teach the computer to do it
- Cantu chooses a combination: supervised (human-in-the-loop) machine learning to teach the machine how to classify ballots
## Cantu's Classification Approach
- You take a random sample of several hundred images
- You **label** them as fraud or clean:
- usually this means creating subfolders with **class1** and **class2** (...) images
- You invoke a machine-learning model (something like a very non-transparent regression) in Python
- You specify parameters such as:
- how many images to train on and how many to use for validation/testing
- the latter means the computer sets aside some (often 20 per cent, random) labelled data and does not use it when learning
- the goal is to avoid overfitting, focusing on some irrelevant feature that predicts things in sample very well but makes the model less generalizable
- there are other parameters to set such as how many times to pass over the data (epochs), whether to resize, crop, revert to black and white
## Cantu's Result
- The computer predicts/labels all 50 K images
- You run a regression testing some theory of interest on which sections experience more fraud
- Main takeaway - a human can do this, but computer is faster - by using ML we can do more research
- Good use of data science (so do not use the tools just because you think they are cool and on data that does not matter - theory comes first)
## Types of learning
- Cantu's paper shows supervised ML
- You could run unsupervised models in which computer decides everying - discerning patters for you
- The latter is problematic because human interpretation of the results tends to be post-hoc and hard to defend from a social science perspective
- Some general problems - with all ML models, every time you run it, the result will differ (somewhat), no guarantee what would happen with different sample, why this model, why these options and so on
- While **R** has some good native text-analysis tools, for images it is really **Py**
- Ethical issues appear very fast - do you have permission to use the data, are you training models to discern racial features, what for and so on and so on
## We will download images and do three things
- We will use the 500 social media banner images of current members of the European Parliament to ask
- Does the image contain faces?
- What other objects are in there?
- Do women use different images to communicate than men?
- We will use Jupiter Notebook Py code available as Google Colab online notebook:
- https://colab.research.google.com/drive/1PgwwMrvBnzJabIgQmuWqqTCabvEtJ_N5?usp=sharing
- You need to upload a folder named ep_member_banner_img to your Google Drive:
- the source folder is in GU OneDrive EP_MPs_BannerImages_FullSet
- ep_member_banner_img should go in the main or root directory in Google Drive
## We can ask Chat GPT for the code
- Pose your question while explaining clearly - verbose is good, e g:
- Write Python code to run in Google Colab that goes through a folder of images in my Google Drive, detects whether each image contains a human face, and prints out which files contain faces by saying filename (use filenames) contains face or does not contain face. Stop after checking 20 images to save on time. The folder in which my images sit is called ep_member_banner_img and is in the root or main directory.
- It gave the following code which:
- chose for us OpenCV’s pre-trained face detector
- loads, loops through images, checks whether they are in fact images, creates a counter, creates a list of length 1,2 for number of faces it finds, and if the list has at least one element, prints the message has face and does not have otherwise
## Another request to Chat GPT
- We pose the question similarly to before:
- Write Python code to run in Google Colab that goes through a folder of images in my Google Drive, detects whether each image contains objects, and prints out a list of file name, set of objects found. I do not want too many objects. The images are the banner pictures on social media of politicians. The folder in which my images sit is called ep_member_banner_img and is in the root or main directory. You can suggest a library of package.
- It gave the following code which:
- uses an AI model to recognize what’s in each image, and prints out the most likely things (labels) that the model “sees.”
- imports tools: os — helps find files in folders, PIL.Image — opens and reads image files, transformers.pipeline — loads a ready-made AI model for image recognition, sets the folder path, the code uses Hugging Face’s transformers library and loads a model called “google/vit-base-patch16-224” (a Vision Transformer model), this model was trained to recognize objects, animals, people, etc., in images, tries to load the image in RGB format, if an image can’t be opened (e.g., it’s corrupted), it prints a warning and moves on
- Runs the AI classifier, Sends the image through the Vision Transformer model., Asks for the top 5 guesses (“top_k=5”) of what’s in the picture, each guess includes a label (e.g., “person,” “dog,” “suit,” “flag”) and a confidence score (how sure the model is), filters and prints results, keeps only the labels where the model is at least 20% confident (score > 0.2)
- Prints the filename and those labels.
## Here we ask whether the computer can learn to classify male and female images
- We pose the question the the Chat:
- Write Python code to run in Google Colab that goes through a folder of images in my Google Drive. The images are the banner pictures on social media of politicians. The folder in which my images sit is called ep_member_banner_img and is in the root or main directory. If an image name contains _f_ it is female and if _m_ it is male politician. I want you to try to learn to classify male and female images. You can suggest a library of package. I want you to look at 80 per cent of images, saving some for training/prediction. I want u to create a file with the results (describe the structure) and I want you to evaluate the accuracy
- It gave the following code which:
- It mounts Google Drive to access your image folder, imports standard libraries: torch, torchvision, PIL → used for building and training the neural network, pandas and sklearn → used for analyzing results and saving them neatly.
- Defines the folder with images and where to save the output file (predicted_labels.tsv).
- This is a small “helper” class (GenderDataset) that: finds all the .jpg, .jpeg, .png files in the folder, reads the label from the filename: _f_ means female → label = 0, _m_ means male → label = 1, anything else → unknown (-1, ignored during training), loads each image in color (RGB), applies optional image transformations (like resizing and normalization), returns the image, its label, and the filename when requested. This class lets PyTorch handle images efficiently during training.
- Transformations, Before feeding images to the model, they’re resized to 224×224 pixels, converted to a PyTorch tensor, normalized (this just helps the model train better by scaling pixel values).
- Creates the full dataset, keeps only images that actually have _f_ or _m_ in the filename, splits those into: 80% for training (to teach the model), 20% for testing (to check how well it learned), wraps them into “data loaders” that feed small batches (16 at a time) into the model — this makes training faster and less memory-intensive.
- Loads a pre-trained ResNet-18 model — a popular CNN (Convolutional Neural Network) originally trained on millions of images, sets up: a loss function (CrossEntropy) to measure how wrong predictions are, an optimizer (Adam) to adjust the model’s weights, moves everything to GPU (cuda) if available, for speed, a third loader (all_loader) is used later to get predictions for all images.
- Runs for 3 full passes (epochs) through the training data, for each batch of images: sends them to the model, gets predictions, calculates the loss (how far off the guesses were), adjusts the model to improve performance, after each epoch, it prints a simple “done” message, this is the learning phase — the model gradually figures out what distinguishes male vs. female politicians’ images.
- Turns off training mode and checks the model on the test set (the 20% of images it hasn’t seen), collects predictions and compares them to the true labels, Calculates: Accuracy → percentage of correct predictions, F1 score → a balance between precision and recall, Prints a detailed classification report showing results for each class (female/male)
- Now it uses the trained model to classify every image in the folder — even those without _f_ or _m_ in the name, for each image it records: Filename Model’s predicted label (0=female, 1=male), True label if known (or blank if not labeled), Whether that image was used during training (seen_in_training=1 or 0), Then it saves everything into a tab-separated file (predicted_labels.tsv) on Google Drive.
- This takes a few minutes to run, it shows that male politicians can be predicted F1 score of 70 per cent but female not so much (generally F1 more than 70 is considered not bad). More generally, the message is that political communication *is* gendered - and invites us to figure out why or how