# Loading and Exploring Data

This notebook will guide you through loading and exploring some of the demonstration datasets we'll be using in this course.

> 💡 At the end, there is a short exercise for you to practice loading and exploring your own dataset.



<img src="https://github.com/annekroon/gesis-machine-learning/blob/fall-2025/pictures/lego_stack/01_preprocessing.png?raw=1" alt="Preprocessing diagram" style="max-width: 150px;">

These datasets include:

- The [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/) - a collection of 50,000 labeled IMDB reviews for binary sentiment classification.
- The [NELA-PS dataset](https://surfdrive.surf.nl/files/index.php/s/bfNFkuUVoVtiyuk) -  a collection of 'pink slime' partisan news articles (from which we will use a small sample).
- The [WSJ-Reuters Financial News dataset](https://github.com/Finance-And-ML/News-Article-And-Full-Details-Dataset) - a collection of financial news from the Wall Street Journal and Reuters.
- The [fetch_20newsgroups dataset](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) - a collection of approximately 20,000 newsgroup documents, organized into 20 different newsgroups.


In addition, you can load and explore your own dataset at the end of the notebook.

If you do not yet have a dataset, you might consider taking a look at [kaggle](www.kaggle.com/datasets) or [huggingface](https://huggingface.co/datasets), or the [Harvard Dataverse](https://dataverse.harvard.edu/) to find one that interests you.


In [None]:
import os
image_path = "../../../../pictures/lego_stack/01_preprocessing.png"
print("File exists:", os.path.exists(image_path))
print("Absolute path:", os.path.abspath(image_path))

## Setup

> 💡 We're going to start by loading in all the packages we'll need for this. This is best practice and will help keep our code organized.

In [58]:
#Let's get starting importing the necessary libraries.
#The reason that we import these at the start is so that we can see if there are any issues with our environment before we start running any code.
#...it also makes it easier to keep track of what packages we are using (and avoid loading the same package multiple times).

import os #allows python to talk to the operating system (e.g., to get filepaths).
import json #we'll need this for reading in .json files (dictionaries)
import pandas as pd #pandas will allow us to work with dataframes (tabular, or .csv-like data)
from glob import glob #glob is helpful if you want to load in multiple files at once (it constructs file paths).

#If everything imported successfully, you should see no error messages! Else, you may need to install some packages (% pip install package-name).

#Let's also set a variable that points to the folder where our data is stored:
#datadir = '/Users/rupertkiddle/Desktop/teach/2025/IMLfTAwP (GESIS)/3_datasets/'

## 1. Large Movie Review Dataset (IMDB)

### loading the data

In [91]:
#first, let's define where this dataset is stored on your computer:
datadir = '/Users/rupertkiddle/Desktop/teach/2025/IMLfTAwP (GESIS)/3_datasets/aclImdb/'

In [92]:
from google.colab import files
import zipfile
import os

# Upload the zip file
uploaded = files.upload()

# Let's assume the file is named "mydata.zip"
zip_path = list(uploaded.keys())[0]  # get uploaded filename
extract_dir = "/content/data"

# Extract into a folder
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

# Point datadir to where files got extracted
datadir = extract_dir
print("Extracted to:", datadir)

# Check the nested structure
for root, dirs, files in os.walk(datadir):
    print(root, len(files), "files")

Saving aclImdb_v1.tar.gz to aclImdb_v1.tar.gz


BadZipFile: File is not a zip file

In [126]:
import tarfile

tar_path = "aclImdb_v1.tar.gz"
extract_dir = "/content/data"

with tarfile.open(tar_path, "r:gz") as tar:
    tar.extractall(path=extract_dir)

print("Extracted to:", extract_dir)


  tar.extractall(path=extract_dir)


Extracted to: /content/data


> 💡 If you take a look at this dataset in your file explorer, you will notice that it is made up of a lot of .txt files that are arranged in a nested folder structure, where the folders represent the labels (positive or negative sentiment) and also the data batch (train or test). In cases like this, `glob` is very useful for loading files - so we'll use it below.

In [None]:
#Now, we're going to write a loop that loads in all of the txt files in the dataset.
#Rather than point towards each individual file, we're going to use the glob package to find all of the files that match a certain pattern.

imdb_lists = [] #first we create an empty list to store our data in.

#for each file that matches the pattern (any .txt file in the pos or neg folders in train or test):
for file in glob(os.path.join(datadir, "train", "pos", "*.txt")) + \
           glob(os.path.join(datadir, "train", "neg", "*.txt")) + \
           glob(os.path.join(datadir, "test", "pos", "*.txt")) + \
           glob(os.path.join(datadir, "test", "neg", "*.txt")):
    #we're going to read the file and store its contents in a variable called 'text':
    #(notice how we are insdide the 'for' loop here, as per the indentation, meaning this will iterate over every file found by glob)
    with open(file, "r", encoding="utf-8") as f:
        text = f.read()
    #we also want to extract the batch (train/test) and label (pos/neg) from the file path:
    batch = "train" if "train" in file else "test"
    label = "pos" if "pos" in file else "neg"
    #now we append the features we just obtained, containing the batch, text, and label to our data_lists variable:
    imdb_lists.append([batch, text, label])

#The result is a list of lists, where each inner list contains the batch, text, and label for one review.
imdb_lists[:3] #let's look at the first three entries in our list of lists.

### exploring the data (as a list)

In [None]:
#How many samples do we have in total? (hint: what is the length?)

#Write a loop or list comprehension that counts how many samples there are of each label (pos/neg) and batch (train/test).

#Write a loop or list comprehension that returns a test_list and train_list, containing only the test and train samples respectively.

#Would you want to use a tuple instead of a list? Where? Why/why not?


In [None]:
#A short introduction to the very useful .zip() function:

#Sometimes, you may want to convert nested lists (like these) into separate lists:
batches, texts, labels = zip(*imdb_lists) #the * operator tells .zip() to 'unpack' the nested lists into separate lists, which you must name.

#Now you have three lists, so that you can access the batches, texts, and labels separately:
print(batches[:3]) #first three batches
print(texts[:3]) #first three texts
print(labels[:3]) #first three labels

#Putting them back together again is just as easy:
imdb_lists = list(zip(batches, texts, labels))

> 💡 With a dataset as simple as this, you would often not bother to cast it into any more complex data structures like a dictionary or a dataframe.

> However, let's at least explore how a dataframe can be useful for inspecting your data:

### exploring the data (as a dataframe)

In [None]:
#It an be useful to get a 'birds eye view' of the data by converting it to a dataframe:
imdb_df = pd.DataFrame(imdb_lists, columns=["batch", "text", "label"]) #we pass in the data (imdb_lists) and also the column names (batch, text, label)

#Now let's look at the first few rows of the dataframe:
imdb_df.head() #.head() shows the first five rows of the dataframe.

#NOTE: under the hood, our lists have now been converted into pd.Series objects, which are like lists but with a lot of extra functionality (all tied to Pandas).

In [None]:
#Pandas can give us some quick summary statistics about the dataframe:
imdb_df.describe() #.describe() gives us some summary statistics about the dataframe.

In [None]:
#Pandas can also tell us if there are any missing values:
imdb_df.isnull().sum() #.isnull().sum() tells us how many missing values there are in each column.
#BONUS: what happens if we just run the method .isnull() without the .sum()? Why?

In [None]:
#Pandas can even tell us the distribution of values in a column:
imdb_df['label'].value_counts()
#NOTE: notice that here we are calling .value_counts() on a specific column of the dataframe (the 'label' column).

In [None]:
#or the distribution of the length of the reviews:
imdb_df['text'].str.len().describe()
#NOTE notice that we are chain-calling the len() and describe() methods here, using the .str accessor to tell pandas that we want to treat the 'text' column as strings.

### summing up what we see

**IMDB Large Movie Review Dataset Overview:**

- **Total samples**: 50,000 movie reviews from IMDB
- **Labels**: Binary classification (positive/negative sentiment)
- **Data split**:
  - 🔸 Training set: 25,000 reviews
  - 🔸 Test set: 25,000 reviews
- **Class balance (i.e., distribution of labels)**:
  - ✅ Training: 12,500 positive + 12,500 negative
  - ✅ Test: 12,500 positive + 12,500 negative
- **Data quality**:
  - ✅ No missing values (confirmed via `.isnull()`)
  - ✅ Minimal duplicates (confirmed via `.describe()`)

> 💡 Of course, we do not yet know how clean the text fields of the reviews are.

## 2. NELA-PS Dataset

### loading the data

In [None]:
#first, let's define where this dataset is stored on your computer:
datadir = '/Users/rupertkiddle/Desktop/teach/2025/IMLfTAwP (GESIS)/3_datasets/NELA-PS/'

In [59]:
from google.colab import files
import zipfile
import os

# Upload the zip file
uploaded = files.upload()

# Let's assume the file is named "mydata.zip"
zip_path = list(uploaded.keys())[0]  # get uploaded filename
extract_dir = "/content/data"

# Extract into a folder
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

# Point datadir to where files got extracted
datadir = extract_dir
print("Extracted to:", datadir)

# Check the nested structure
for root, dirs, files in os.walk(datadir):
    print(root, len(files), "files")


Saving articles-small.tar.gz to articles-small.tar (1).gz


BadZipFile: File is not a zip file

In [62]:
import tarfile

tar_path = "articles-small.tar.gz"
extract_dir = "/content/data"

with tarfile.open(tar_path, "r:gz") as tar:
    tar.extractall(path=extract_dir)

print("Extracted to:", extract_dir)




  tar.extractall(path=extract_dir)


Extracted to: /content/data


In [63]:
datadir = "/content/data"


> 💡 If you take a look at this dataset in your file explorer, you will see we have another `glob' situation. In this case we have articles nested within outlet-labeled folders nested within date-labeled folders. So, we can use a similar logic as we did for the IMDB dataset, to preserve these 'labels' when loading the articles.

> ⚠️ If you look a little closer, you will notice that the articles do not have file extensions. While it is convention to give files extensions (txt, csv, etc), these are not actually necessary for the computer to read them. We can just use a wildcard (*) in our glob pattern to match any file.

In [64]:
#Now, we're going to write a loop that loads in all of the txt files in the dataset.
#Rather than point towards each individual file, we're going to use the glob package to find all of the files that match a certain pattern.

nela_lists = [] #first we create an empty list to store our data in.

#for each file that matches the pattern (any file within outlet folders within date folders):
for file in glob(os.path.join(datadir, "articles", "*", "*", "*")):
    #we're going to read the file and store its contents in a variable called 'text':
    #(notice how we are inside the 'for' loop here, as per the indentation, meaning this will iterate over every file found by glob)
    with open(file, "r", encoding="utf-8") as f:
        text = f.read()

    #we want to extract the date and outlet from the file path:
    path_parts = file.split(os.sep)  # Split the path by the OS separator
    date = path_parts[-3]  # Date folder (YYYY-MM-DD format), which is the 'third last' part of the path
    outlet = path_parts[-2]  # Outlet folder, which is the 'second last' part of the path

    #now we append the features we just obtained, containing the date, outlet, and text to our data_lists variable:
    nela_lists.append([date, outlet, text])

#The result is a list of lists, where each inner list contains the date, outlet, and text for one article.
nela_lists[:3] #let's look at the first three entries in our list of lists.

[['2018-04-10',
  'Infowars',
  'There will be no jail time for two men involved in the brutal gang-rape of a 13-year-old girl.\n\nLike two other defendants before them Tommy Williams, 20, and Clarence Williams, 19, each received a minimum of 10 years of sex offender intensive supervised probation, with a maximum of life on probation.\n\nThe family of the young victim was outraged after learning the sentence Tuesday.\n\nThere was six of them and they gang-raped my granddaughter, said the girls grandfather, who got in a shouting match with the Williams family members and was forced to leave. He protested the plea bargain, angrily stating the men should do time.'],
 ['2018-04-10',
  'Infowars',
  'Its a typical scene at college campuses across the country  students with their noses buried in books, engaged in the pursuit of knowledge and hoping to one day make their mark on the world.\n\nIts not the first place one would imagine finding government spies, but investigative reporter Dan Go

> 💡 Great, so now we have the same data structure as we did before (a list of lists). Let's put this in a dataframe to summarize the data:

### exploring the data

In [65]:
#pass the list of lists to a pd.DataFrame():
nela_df = pd.DataFrame(nela_lists, columns=["date", "outlet", "text"]) #we pass in the data (nela_lists) and also the column names (date, outlet, text)

#Inspect the first few rows of the dataframe:
nela_df.head()

Unnamed: 0,date,outlet,text
0,2018-04-10,Infowars,There will be no jail time for two men involve...
1,2018-04-10,Infowars,Its a typical scene at college campuses across...
2,2018-04-10,Infowars,Senate Democratic Leader Charles Schumer (N.Y....
3,2018-04-10,Infowars,The special counsels office is investigating a...
4,2018-04-10,Infowars,There is much speculation as to the significan...


In [68]:
#how many samples do we have in total? (hint: a dataframe has .shape attribute that you can call)
nela_df.shape

(10000, 3)

In [69]:
#do any columns have missing data?
nela_df.isnull().sum()

Unnamed: 0,0
date,0
outlet,0
text,0


In [70]:
#let's check the date range:
nela_df['date'] = pd.to_datetime(nela_df['date']) #first we convert the 'date' column to datetime format (it is currently a string)
min_date = nela_df['date'].min() #get the minimum value of the 'date' column
max_date = nela_df['date'].max() #get the maximum value of the 'date' column
print(max_date.date()), print(min_date.date())

#BONUS: can you calculate the total date range in days?


2018-11-30
2018-02-02


(None, None)

In [72]:
#A quick introduction to 'F-strings' (formatted string literals) -

#Sometimes, you want to include variables inside strings, for example:
print(f"The dataset contains articles from {min_date.date()} to {max_date.date()}.")
#...the f before the string tells Python to interpret any variables inside curly braces {}.
#These are really useful for printing out informative messages that include the values of variables.

The dataset contains articles from 2018-02-02 to 2018-11-30.


In [73]:
#what are the distributions of articles per outlet?
nela_df['outlet'].value_counts()

Unnamed: 0_level_0,count
outlet,Unnamed: 1_level_1
Infowars,2000
The Guardian,2000
CNN,2000
BBC,2000
Vox,2000


In [74]:
#what are the distribution of articles by date?
nela_df['date'].dt.date.value_counts().sort_index()

Unnamed: 0_level_0,count
date,Unnamed: 1_level_1
2018-02-02,13
2018-02-03,6
2018-02-04,9
2018-02-05,13
2018-02-06,22
...,...
2018-11-26,35
2018-11-27,27
2018-11-28,66
2018-11-29,46


### summing up what you see

**NELA-PS Dataset Overview:**

*[Enter your observations here]*


## 3. WSJ_Reuters Financial News Dataset

> 💡 Now things get a little more interesting. If you take a look at the dataset, you will find a single file in .JSON format. These can store all kinds of data types, but often you will find they contain dictionaries (key-value pairs). Still, the only way to know, is to take a look.

### loading the data

In [None]:
#first, let's define where this dataset is stored on your computer:
datadir = '/Users/rupertkiddle/Desktop/teach/2025/IMLfTAwP (GESIS)/3_datasets/WSJ-Reuters/'

#let's use glob to get the filepath of the .json file in the directory:
file_path = glob(os.path.join(datadir, "*.json"))
file_path #this should return a list with a single filepath in it.


In [None]:
#now, let's take a peek at the first two lines of the file:
with open(file_path[0], "r", encoding="utf-8") as f:
    first_line = f.readline() #read the first line of the file
    second_line = f.readline() #read the second line of the file
print(first_line) #print the first line
print(second_line) #print the second line

#See the {key: value} pairs? This indicates that the file indeed contains data in dictionary format.
#NOTE: you could also just open the file in a text editor to peek, but it might not work if the file is very large!

In [None]:
#Great, so now we know the structure of the file: it contains one dictionary per line. Let's load it in to a list of dictionaries:
data_dicts = [] #create an empty list to store the dictionaries in
with open(file_path[0], "r", encoding="utf-8") as f:
    for line in f: #for each line in the file
        data_dicts.append(json.loads(line)) #load the line as a dictionary and append it to the list

#...and let's look at the first two dictionaries:
data_dicts[:2]

### exploring the data

In [None]:
#Let's take a look at the keys in the first dictionary:
data_dicts[0].keys() #notice we are calling .keys() on the first element [0] of the list of dictionaries.

> 💡 From here, feel free to explore the data as you like. Since it appears to be 'flat' (i.e., does not have nested collections), you could convert it to a dataframe. Alternatively, you could perform analysis directly on the lists of dictionaries, or pull the values out into separate lists using a loop or list comprehension.

In [None]:
#your code...

In [None]:
#CHALLENGE: Write a loop that creates a dict where the keys are the URLs and the values are dicts containing the other key-value pairs for that article.
#Your final dict should look like:
#{"url1": {'news_title': '...', 'news_time': '...', 'content': '...', 'keywords': '...'},
# "url2": {'news_title': '...', 'news_time': '...', 'content': '...', 'keywords': '...'}, ...}
#NOTE: This is a common structure to have if you scrape data from the web yourself - some ID as key (e.g., URL), and the other contained as k:v pairs in a nested dict.

> 💡 The true worth of the dict structure becomes apparent when you have deeply nested datasets. This is not something that we will encouter with the example datasets in this course, but if you do have such data yourself, and need some assistance wrangling it, feel free to ask!

### summing up what you see -

**WSJ-Reuters Dataset Overview:**

*[Enter your observations here]*


## 4. Fetch_20newsgroups Dataset

> 💡 This dataset we will not download in the usual manner. It is a dataset that is served by the sklearn library. Nowadays, several machine learning libraries include built-in datasets for convenience. For example, take a look at [sklearn datasets](https://scikit-learn.org/stable/api/sklearn.datasets.html) and [HuggingFace datasets](https://huggingface.co/docs/datasets/en/index). For such datasets, we can retrieve them within our code.

In [None]:
#first we need to import the function to fetch the dataset:
from sklearn.datasets import fetch_20newsgroups

#now we can fetch the dataset (hover over the function name to see the docstring)
docs = fetch_20newsgroups()

> 💡 Take a look at sklearn's guide to the [20 newsgroups dataset](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset) to get an idea of what sort of data is there and how you can pass arguments to the function above to get different parts of the dataset. Additionally, feel free to explore their [other datasets](https://scikit-learn.org/stable/api/sklearn.datasets.html) as you like.


### summing up what you see -

**Fetch_20newsgroups Dataset:**

*[Enter your observations here]*


## 5. Your own dataset!
> 💡 Feel free to load and play around with your own dataset. If you do not yet have one, see the notes at the top of this workbook.

## TIKTOK METADATA

In [75]:
# load data
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving scraped_data-test2.csv to scraped_data-test2.csv
User uploaded file "scraped_data-test2.csv" with length 174816 bytes


In [116]:
import pandas as pd

# Load CSV into a DataFrame
df = pd.read_csv("scraped_data-test2.csv")

# Show the first 5 rows
print(df.head())


                  date                                               link  \
0  2025-07-22 09:48:11  https://www.tiktokv.com/share/video/7505430917...   
1  2025-07-22 09:48:18  https://www.tiktokv.com/share/video/7519052633...   
2  2025-07-22 09:48:32  https://www.tiktokv.com/share/video/7526540252...   
3  2025-07-22 09:48:35  https://www.tiktokv.com/share/video/7529098443...   
4  2025-07-22 09:48:36  https://www.tiktokv.com/share/video/7517016897...   

             contentID  duration      authorID AdContent AdAccount  diggCount  \
0  7505430917360487702      59.0  7.252552e+18     False     False    71700.0   
1  7519052633512873238      47.0  7.476754e+18     False     False    36300.0   
2  7526540252249345312     269.0  7.433888e+18     False     False    31700.0   
3  7529098443956981025      15.0  6.949597e+18     False     False   151500.0   
4  7517016897272024342      44.0  7.484669e+18     False     False   108800.0   

   commentCount  playCount  shareCount           u

In [118]:
# How many rows, columns?
df.shape

(612, 13)

In [119]:
# do any columns have missing data? - many, deleted tiktoks - and some without
# descriptions?
df.isnull().sum()

Unnamed: 0,0
date,0
link,0
contentID,0
duration,39
authorID,39
AdContent,39
AdAccount,39
diggCount,39
commentCount,39
playCount,39


In [120]:
#let's check the date range:
df['date'] = pd.to_datetime(df['date']) #first we convert the 'date' column to datetime format (it is currently a string)
min_date = df['date'].min() #get the minimum value of the 'date' column
max_date = df['date'].max() #get the maximum value of the 'date' column
print(max_date.date()), print(min_date.date())

# Range of dates
date_range = max_date - min_date
total_days = date_range.days
print(f"The total date range is {total_days} days.")

2025-09-05
2025-07-22
The total date range is 44 days.


In [125]:
# Define migration keywords
migration_keywords = [
    "migration", "immigration", "migrant", "immigrant",
    "refugee", "asylum", "border", "deport", "visa",
    "citizenship", "undocumented", "illegal", "migratie",
    "ilegal", "immigratie", "asiel", "vluchteling"
]

# Check transcripts for these words
# Lowercase transcripts for matching
df["description_lower"] = df["description"].fillna("").str.lower()

# Flag if any keyword is in the transcript
df["mentions_migration"] = df["description_lower"].apply(
    lambda x: any(kw in x for kw in migration_keywords)
)

# Count how many transcripts mention migration
df["mentions_migration"].value_counts()


Unnamed: 0_level_0,count
mentions_migration,Unnamed: 1_level_1
False,583
True,29


## YOUTUBE TRANSCIPITON DATA
So let's also download the youtube dataset

In [85]:
# load data
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving yt-transcript-test-batch.csv to yt-transcript-test-batch.csv
User uploaded file "yt-transcript-test-batch.csv" with length 231985 bytes


In [94]:
import pandas as pd

df2 = pd.read_csv("yt-transcript-test-batch.csv", sep=None, engine="python")
df2.head()


Unnamed: 0,﻿IDcode,transcript,error
0,DprAK933Fdg,"Hi guys, it's me again and today we're going t...",
1,ck6xGcJ3ejo,Hello noble ones. Welcome back to my channel. ...,
2,9YDzf5E6PRU,Hello and no ones. Welcome back to my channel....,
3,t-BWfG3IN5U,[Music] We ready? [Music] You know what? Chewb...,
4,pIxkM7O-CrA,um which is obviously a very relevant and and ...,


In [95]:
# explore a bit

df2.shape        # number of rows, columns
df2.info()       # column names + types + non-null counts
df2.head()       # first 5 rows
df2.tail()       # last 5 rows


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ﻿IDcode     25 non-null     object 
 1   transcript  25 non-null     object 
 2   error       0 non-null      float64
dtypes: float64(1), object(2)
memory usage: 732.0+ bytes


Unnamed: 0,﻿IDcode,transcript,error
20,DmA6-pS-Dmk,Chaos and horror on a college campus. Conserva...,
21,Jy9tNyp03M0,[Music] [Applause] thank you very much Mr Pres...,
22,XfKeZruVylQ,Paul Davis has racked up a legion of online fo...,
23,XzbsCpiA5Q0,"Yes, sir. All right. Uh, is this thing on? Oh,...",
24,_uk_6vfqwTA,This video was made possible by Skillshare. Ch...,


In [96]:
# missing data?
df2.isna().sum()


Object `data` not found.


Unnamed: 0,0
﻿IDcode,0
transcript,0
error,25


In [100]:
# let's look at the transcripts a bit

# Length of each transcript
df2["length"] = df2["transcript"].str.len()

# Summary stats of lengths
df2["length"].describe()


Unnamed: 0,length
count,25.0
mean,9257.32
std,7070.656704
min,119.0
25%,2653.0
50%,8978.0
75%,15288.0
max,22497.0


In [101]:
# the longest?

# Peek at longest transcript
longest = df2.loc[df2["length"].idxmax()]
print("Longest transcript ID:", longest["\ufeffIDcode"])
print("Transcript snippet:\n", longest["transcript"][:500])

Longest transcript ID: XzbsCpiA5Q0
Transcript snippet:
 Yes, sir. All right. Uh, is this thing on? Oh, I guess it is. All right. Uh, my name is Blake. [Music] [Applause] Freedom of speech. [Music] Why don't Come up. Come up to the mic. [Music] Come up to the mic. [Music] [Applause] Come on. Come on. Come up to the mic. No. Come on. Come up to the mic. USA. USA. USA. USA. USA. USA. Since you're so smart, show the world how smart you are. We'll give you our mic, not your bullhorn. So, come on up. So now we know that you're like a blatant narcissist tha


In [102]:
# Most common words (very simple split)
from collections import Counter

all_words = " ".join(df2["transcript"].dropna()).split()
Counter(all_words).most_common(20)


[('the', 1708),
 ('to', 1157),
 ('of', 965),
 ('a', 961),
 ('and', 909),
 ('that', 738),
 ('you', 700),
 ('I', 624),
 ('in', 611),
 ('is', 556),
 ('it', 339),
 ('for', 324),
 ('have', 322),
 ('this', 312),
 ('are', 268),
 ('on', 260),
 ('they', 260),
 ('we', 254),
 ('like', 253),
 ('people', 249)]

In [104]:
# cleanup
import re
from collections import Counter
from nltk.corpus import stopwords

# Download stopwords if needed
import nltk
nltk.download("stopwords")

extra_stops = {"uh", "um", "yeah", "like", "youre", "im", "dont", "thats", "well"}
stop_words = set(stopwords.words("english")).union(extra_stops)


# Join all transcripts into one text
all_text = " ".join(df2["transcript"].dropna()).lower()

# Remove punctuation & non-letters
all_text = re.sub(r"[^a-z\s]", "", all_text)

# Split into words
words = all_text.split()

# Remove stopwords
words = [w for w in words if w not in stop_words]

# Count frequencies
freq = Counter(words).most_common(20)
freq


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


[('people', 295),
 ('know', 182),
 ('think', 132),
 ('want', 132),
 ('would', 116),
 ('right', 111),
 ('one', 106),
 ('say', 104),
 ('going', 97),
 ('get', 95),
 ('theyre', 89),
 ('go', 84),
 ('even', 82),
 ('country', 75),
 ('okay', 72),
 ('see', 69),
 ('make', 69),
 ('really', 68),
 ('come', 68),
 ('us', 68)]

In [106]:
# word pairs?
from collections import Counter
from nltk.util import ngrams

bigrams = Counter(ngrams(words, 2)).most_common(20)
bigrams


[(('star', 'wars'), 18),
 (('lot', 'people'), 16),
 (('united', 'states'), 15),
 (('would', 'say'), 14),
 (('charlie', 'kirk'), 13),
 (('thank', 'much'), 11),
 (('free', 'speech'), 11),
 (('little', 'bit'), 10),
 (('make', 'sure'), 10),
 (('people', 'come'), 10),
 (('illegal', 'immigrants'), 10),
 (('many', 'people'), 9),
 (('want', 'say'), 9),
 (('want', 'want'), 9),
 (('cyber', 'ghostvpn'), 9),
 (('political', 'violence'), 9),
 (('youve', 'got'), 8),
 (('know', 'know'), 8),
 (('law', 'enforcement'), 8),
 (('take', 'us'), 8)]

### Let's try to look if we can extract migration topics/vids

In [107]:
# Define migration keywords
migration_keywords = [
    "migration", "immigration", "migrant", "immigrant",
    "refugee", "asylum", "border", "deport", "visa",
    "citizenship", "undocumented", "illegal"
]

# Check transcripts for these words
# Lowercase transcripts for matching
df2["transcript_lower"] = df2["transcript"].str.lower()

# Flag if any keyword is in the transcript
df2["mentions_migration"] = df2["transcript_lower"].apply(
    lambda x: any(kw in x for kw in migration_keywords)
)

# Count how many transcripts mention migration
df2["mentions_migration"].value_counts()



Unnamed: 0_level_0,count
mentions_migration,Unnamed: 1_level_1
True,13
False,12


In [109]:
# Filter relevant transcripts
migration_transcripts = df2[df2["mentions_migration"]]
migration_transcripts[["\ufeffIDcode", "transcript"]].head()

Unnamed: 0,﻿IDcode,transcript
2,9YDzf5E6PRU,Hello and no ones. Welcome back to my channel....
4,pIxkM7O-CrA,um which is obviously a very relevant and and ...
5,jV5V8MrdfJ0,the Joe Rogan experience. That's the thing. Th...
9,G5FHblETFJw,on the Irish. We're going to get our country b...
10,giaj8hUanow,Are you ready? People are traveling from all o...


In [111]:
# Which keywords appear the most?
from collections import Counter

found_terms = []

for text in df2.loc[df2["mentions_migration"], "transcript_lower"].dropna():
    for kw in migration_keywords:
        if kw in text:
            found_terms.append(kw)

Counter(found_terms).most_common()


[('illegal', 9),
 ('migration', 8),
 ('immigration', 8),
 ('migrant', 7),
 ('border', 7),
 ('deport', 6),
 ('immigrant', 5),
 ('asylum', 4),
 ('citizenship', 2),
 ('visa', 1),
 ('undocumented', 1),
 ('refugee', 1)]