# TechCrunch articles analysis

> If you're running this notebook using a tool like `VS Code`, you might need to run the following 2 cells.

In [1]:
%pip install -q pymongo dotenv spacy transformers torch wordcloud matplotlib gliner ipywidgets tqdm

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 26.0.1
[notice] To update, run: C:\Users\paulg\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [2]:
import sys
!{sys.executable} -m spacy download en_core_web_sm

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ------------------------------ -------- 10.0/12.8 MB 47.8 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 36.5 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')



[notice] A new release of pip is available: 25.0.1 -> 26.0.1
[notice] To update, run: C:\Users\paulg\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


---
> Imports

In [3]:
import matplotlib.pyplot as plt
import os
import pandas as pd
import plotly.express as px
import spacy
import tqdm

from collections import Counter
from datetime import datetime
from dotenv import load_dotenv
from gliner import GLiNER
from ipywidgets import interact
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi
from transformers import pipeline
from utils import TECH_STOPWORDS, SUB_CATEGORIES
from wordcloud import WordCloud, STOPWORDS

load_dotenv()

True

> We set up the connection with the `MongoDB` cluster to read its data

In [4]:
uri = os.getenv('MONGO_PUBLIC_URI')
client = MongoClient(uri, server_api=ServerApi('1'))
db = client['tech_scraper_db']
data = db['articles']

> We look for the count of published articles par category and per year to see which ones might be the most popular through the years

In [None]:
pipeline_macro = [
    {
        '$group': {
            '_id': {
                'year': {'$year': '$timestamp'},
                'category': '$category'
                },
            'count': {'$sum': 1}
        }
    },
    {'$sort': {'_id.year': 1}}
]

data_macro = list(data.aggregate(pipeline_macro))
df_macro = pd.DataFrame([
    {'year': d['_id']['year'], 'category': d['_id']['category'], 'count': d['count']}
    for d in data_macro
])

def plot_main_categories(df, n=10):
    top_categories = df_macro.groupby('category')['count'].sum().nlargest(n).index
    df_plot1 = df_macro[df_macro['category'].isin(top_categories)]

    fig1 = px.line(df_plot1, x='year', y='count', color='category',
                title=f'top {n} techcrunch categories over time',
                template='plotly_dark', line_shape='spline')
    fig1.show()
    
plot_main_categories(df_macro)

> It seems like the `none` category represents a big part of the data we have. looking at the numbers through the years it seems like the articles didn't really have a category feature at first and it progressively became a habit to give one to the articles.

> We can also see that the categories `hardware`, `startups` and `media & entertainment` represent a vast majority of the articles with a category for the period 2010 to 2022. Again it could be a bad habit coming directly from the media as they only had these categories at the time and took time to start using the other ones to efficently classify their articles.

> Using this graphic we can still make some observations like the evolution of `AI`. We can see its beginnings around 2015 with a little amounts of articles. Then around 2016/17 there is a first surge with some acomplishments in the field and then the huge spike in popularity around 2023 with the worldwide expension and democratization of `AI` tools and applications.

> Seeing this is interesting but it would be even more interesting if we could see the popuplarity of sub categories from these. For example, in the `AI` category it would be interesting to see all the main companies and models. When they appeared. Which specific topics of the field were more popular before but not so much anymore, etc. We could also look for the main actors of the tech industry in the recent years, comparing the mentions of people, companies, etc.

> That is precisely what the next cell is about. We return the top N most mentioned entities of specific sub categories

In [None]:
model = GLiNER.from_pretrained('urchade/gliner_medium-v2.1')

def get_tops(titles, n=30):
    labels = ['person', 'company', 'location', 'genai model']
    res = {}
    for t in titles:
        entities = model.predict_entities(t['title'], labels, threshold=.75)
        for ent in entities:
            if ent['label'] not in res.keys():
                res[ent['label']] = [ent['text'].strip()]
            else:
                res[ent['label']].append(ent['text'].strip())
    return {l: Counter(ents).most_common(n) for l, ents in res.items()}

tops = get_tops(data.find({'timestamp': {'$gte': datetime(2026, 1, 1)}}, {"title": 1}))

for label, top in tops.items():
    print(f'{label}: {[name for name, _ in top]}')


The `resume_download` argument is deprecated and ignored in `snapshot_download`. Downloads always resume whenever possible.



Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

genai model: ['AI', 'OpenAI', 'Anthropic', 'Copilot AI', 'GPT-4o', 'Gemini', 'Bee', 'soonicorn', 'ProducerAI', 'AI agents', 'Mirai', 'AI models', 'AI Assistant', 'R2', 'Fibr AI', 'AI model', 'AI assistants', 'Melania', 'xAI', 'LLM', 'Llama', 'Kimi K2.5', 'Dojo3', 'physical AI', 'Alexa', 'Harmattan AI', 'CLOiD', 'AI-generated', 'Alpamayo']
company: ['Google', 'OpenAI', 'Amazon', 'Apple', 'Anthropic', 'Nvidia', 'Spotify', 'Tesla', 'Microsoft', 'TikTok', 'Waymo', 'SpaceX', 'xAI', 'Uber', 'YouTube', 'Netflix', 'WhatsApp', 'Meta', 'Snap', 'Stripe', 'Instagram', 'Bluesky', 'ICE', 'Discord', 'AMD', 'X', 'Snapchat', 'TechCrunch', 'Luminar', 'OpenClaw']
person: ['Elon Musk', 'Claude', 'Trump', 'Sam Altman', 'Musk', 'Siri', 'Epstein', 'CEO', 'Marquis', 'Amodei', 'Grok', 'Zuckerberg', 'VC', 'Harvey', 'Jeffrey Epstein', 'Nadella', 'Satya Nadella', 'Mark Zuckerberg', 'Palmer Luckey', 'Austin Russell', 'boss', 'Mogul', 'consultants', 'Bill Gurley', 'Susan Rice', 'Scott Rogowsky', 'Ali Partovi', 'hac

> With that data in mind we can have a proper SUB_CATEGORY group listing what to search for exactly. This way we can display another graph for the evolution of articles categories through the years but this time isntead of large categories it can be reduced to more specific topics

In [None]:
def plot_sub_categories(df, target, start=None, end=None):
    KEYWORDS = SUB_CATEGORIES.get(target, {})
    data_list = []
    for label, regex in KEYWORDS.items():
        mask = df['title'].str.contains(regex, case=False, na=False)
        yearly_counts = df[mask].groupby('year').size().reset_index(name='count')
        yearly_counts['keyword'] = label
        data_list.append(yearly_counts)
        
    if data_list:
        final_df = pd.concat(data_list)
        fig = px.line(final_df, x='year', y='count', color='keyword',
                    title=f'evolution of keywords within the {target} category',
                    template='plotly_dark', line_shape='spline')
        fig.show()
        
df_sub = pd.DataFrame(list(data.find({'timestamp': {'$lt': datetime(2026, 1, 1)}}, {'title': 1, 'timestamp': 1})))
df_sub['year'] = pd.to_datetime(df_sub['timestamp']).dt.year

plot_sub_categories(df_sub, 'ai')
plot_sub_categories(df_sub, 'people')
plot_sub_categories(df_sub, 'genai models')

> Next we'll display word clouds to see key words from all different categories and have a better idea of the main actors for each year

In [8]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
def get_actors_only(text):
    doc = nlp(text)
    keywords = [token.text for token in doc if token.pos_ in ['NOUN', 'PROPN']]
    return ' '.join(keywords)

year_text = {}
for doc in data.find({}, {'title': 1, 'timestamp': 1}):
    year = doc['timestamp'].year
    year_text[year] = year_text.get(year, '') + ' ' + doc['title']
    
tech_stopwords = set(STOPWORDS)
tech_stopwords.update(TECH_STOPWORDS)

def show_wordcloud(year):
    text = get_actors_only(year_text[year])
    wc = WordCloud(
        width=800,
        height=400,
        background_color='black',
        colormap='magma',
        stopwords=tech_stopwords,
        collocations=True,
        collocation_threshold=10,
        max_words=100,
        font_path=None
    ).generate(text)
    plt.figure(figsize=(12,8))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    plt.title(f'top trends in {year}')
    plt.show()
    
interact(show_wordcloud, year=(2010, 2026))

interactive(children=(IntSlider(value=2018, description='year', max=2026, min=2010), Output()), _dom_classes=(…

<function __main__.show_wordcloud(year)>

> Finally we'll try to analyse a bit these articles without any category. We'll run a simple transformer on the article titles to categorize them in one of the listed possible labels

> Note: This relabelization work only concerns the local dataset for curiosity and observation purpose only. We won't update the labels in the `MongoDB` cluster as this part of the project is only a bonus and should not interfere with the main part without being instructed otherwise

In [None]:
candidate_labels = ['AI', 'Media & Entertainment', 'Hardware', 'Enterprise', 'Crypto', 'Fintech', 'Security', 'Climate']
classifier = pipeline('zero-shot-classification',
                      model='valhalla/distilbart-mnli-12-3',
                      device=-1)

file_name = 'none_classified.pkl'
if os.path.exists(file_name):
    df_none = pd.read_pickle(file_name)
else:
    df_none = pd.DataFrame(list(data.find({"category": "none"}, {"title": 1, 'timestamp': 1})))
    df_none['predicted_cat'] = None
    
def run_batch(df, batch_size=1000):
    pending = df[df['predicted_cat'].isna().head(batch_size)]
    
    if pending.empty:
        print('all articles classified')
        return df
    
    titles = pending['title'].tolist()
    results = classifier(titles, candidate_labels)
    
    for i, res in enumerate(results):
        df.at[pending.index[i], 'predicted_cat'] = res['labels'][0]
        
    df.to_pickle(file_name)
    print(f'processed {len(pending)} articles. saved to {file_name}')
    return df

df_none = run_batch(df_none, batch_size=100)
df_none.head()

> This is a long work to do so it has been split into batches of work.

> Although we won't update the online database, we can still use our local updated one to display once more the main categories with everything correctly labelled and see if we get any difference from it

In [12]:
#load full data from mongo
#load pickle file
#if pkl file doesn't exist, exit the program
#replace any none category if it has a set counter part from the pkl file
#with the df updated, call show_main_categories(df)