# Data journalism: data visualisation – implementation of interactive graphs (web enabled), infographics.

This notebook explores how sentiment and metadata from social media posts can be used to predict user engagement (likes + retweets). We also correlate trending news topics to online activity. This will help jouranlists find tredning topics via Social media and see how they effect each other. 

This project delivers a real-time trend forecasting web app that analyzes world-related hashtags (e.g., #Fitness, #Climate Change, #Ukraine) on X and Reddit posts combined. It combines social media data with current news headlines (via the News API) Using NLP and machine learning, it extracts trending keywords, predicts post engagement (likes and retweets, upvotes), and forecasts topic popularity over 24–48 hours. The tool is deployed as an interactive Streamlit dashboard, offering visualizations like word clouds and trend curves. A Jupyter notebook documents the full data science workflow.


# Problem Statement: 

Trends on social media emerge and fade rapidly. Marketers, journalists, and researchers often struggle to anticipate these shifts. This project addresses that challenge by forecasting trend lifecycles, helping users optimize content timing and stay ahead of competitors.

## Objectives: 

1. To collect and preprocess real-time social media data from X (formerly Twitter) and Reddit, focusing on globally relevant hashtags (e.g., #ClimateChange, #Ukraine, #Fitness), along with current news headlines using the NewsAPI.
2. To perform sentiment analysis and keyword extraction on social media posts and news headlines using Natural Language Processing (NLP) techniques.
3. To develop predictive models that estimate user engagement, such as likes, retweets, and upvotes, based on post content, sentiment, and metadata (e.g., time posted, hashtag used).
4. To forecast the popularity of trending topics over a 24–48 hour period using time series analysis and trend modeling.
5. To analyze the correlation between news coverage and online social media activity, highlighting how news drives or reflects online trends.
6. To build and deploy an interactive Streamlit dashboard that:

#### Displays real-time trends,

#### Visualizes sentiment and keyword patterns (e.g., word clouds, trend curves),

#### Allows journalists and users to explore topic impact and forecast engagement.

7. To Document process

### Libraries Needed: 



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from textblob import TextBlob
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
import praw
from datetime import datetime




# Data Collection

## Data sources: 
For social media, we use the X api [1]. This allows you gather posts from X within defined parameters. This will be done using hashtags these usally represent trending topics [2]. The newsAPI [3] will be used to gather news articles based on the paramters from the X posts aswell as Reddit [4] posts, for example #Fitness retreived the posts will be the search paramter for the news posts. 

## *Please add other data sources here if used*


1. https://docs.x.com/x-api/introduction
2. https://www.shopify.com/nz/blog/twitter-hashtags
3. https://newsapi.org/
4. https://www.reddit.com/dev/api/

# Limitations: 

Within the bounds of the X api free acount you are entitled to 100 posts from X, with Reddit its a lot more but capped to 60 requests a minute. The combined dataset will give us roughly 600 bits of data to work with with the abilty to add more from reddit when needed. This may skew the data towards Reddit posts but by doing the sentiment scores it will average out over all the data. 

# Ethical data usage: 


### X: 
The X API can be used for a university project if it aligns with X’s License Agreement, prioritizing user privacy, transparency, and ethical data use while avoiding harmful applications like misinformation or unauthorized data scraping. Ensure compliance with platform policies and secure data handling, especially for public interest research, though access may require navigating paid tiers or specific approvals under regulations like the EU’s DSA. (https://developer.x.com/en/developer-terms/agreement-and-policy) 

### NewsAPI: 
The News API (https://newsapi.org/terms) can be ethically used for a university project by adhering to its terms, which require lawful data use, compliance with local regulations, and respecting intellectual property through proper source attribution. Ensure transparency, secure handling of the API key, and limit data use to non-commercial academic purposes within the free tier’s 500 requests/day, avoiding unauthorized redistribution of licensed content.


### Here is how the NewsAPI is used. This wont run on this notebook. 


In [3]:

# --- Initialize News API ---
API_KEY = "7af7d5e56edc4148aac908f2c9f86ac3"  
newsapi = NewsApiClient(api_key=API_KEY)

st.title("📊 Real-Time Social + News Dashboard with Engagement Forecasting")

# --- User Topic Input ---
topic = st.text_input("Enter a topic keyword (e.g., #Fitness, climate change):", "#Fitness")

# --- News Fetching ---
if topic:
    with st.spinner("Fetching news articles..."):
        all_articles = newsapi.get_everything(
            q=topic,
            language='en',
            sort_by='publishedAt',
            page_size=10
        )
    articles = all_articles.get('articles', [])

    st.header(f"📰 Latest News on {topic}")
    if articles:
        for article in articles:
            st.subheader(article['title'])
            st.write(article['description'])
            st.markdown(f"[Read more]({article['url']})")
            st.write(f"Published at: {article['publishedAt']}")
            st.markdown("---")
    else:
        st.write("No news articles found for this topic.")

# --- Load Dataset ---
@st.cache_data
def load_social_data():
    df = pd.read_csv("data/x_posts_with_weather.csv")
    df['created_at'] = pd.to_datetime(df['created_at'], errors='coerce')
    return df

df = load_social_data()

# --- Filter Dataset ---
if topic:
    mask = df['hashtags'].str.contains(topic.replace("#", ""), case=False, na=False)
    filtered_df = df[mask]

    st.header(f"📱 Social Media Posts on {topic}")
    st.write(f"Total posts found: {filtered_df.shape[0]}")

    if not filtered_df.empty:
        st.line_chart(filtered_df.groupby(filtered_df['created_at'].dt.floor('H')).size())
    else:
        st.write("No social media posts found for this topic.")

NameError: name 'NewsApiClient' is not defined

# Reddit Data: 

Reddit uses a libary named 'praw', this is an API wrapper for reddit and is what receives the posts from Reddit. It does require an client ID, client_secret and user_agent which is given to you when you create an app through Reddit developer. 


https://praw.readthedocs.io/en/stable/

## Code Example: 

In [None]:

# Reddit API credentials
reddit = praw.Reddit(
    client_id='v5b2CYNg37amXniM43bNmQ',
    client_secret='cqVeL5VR-vENbiLAjnfC-xoRn45qaQ',
    user_agent="MyRedditSentimentApp/0.1 by noahcrampton"
)

subreddit = reddit.subreddit("all")
posts = []

# You can use .hot(), .new(), or .top(), ill use hot() to get treding postss
for post in subreddit.new(limit=500):
    posts.append({
        "title": post.title,
        "score": post.score,
        "comments": post.num_comments,
        "created": datetime.utcfromtimestamp(post.created_utc),
        "url": post.url,
        "selftext": post.selftext,
        "subreddit": str(post.subreddit)
    })

df = pd.DataFrame(posts).drop_duplicates(subset="title")
df.sort_values(by="created", ascending=False, inplace=True)

print("Columns:", df.columns.tolist())

df.to_csv("data/reddit_all_recent_posts.csv", index=False)
print("Saved to reddit_all_recent_posts.csv")


# Data Cleaning & Preprocessing

# Exploratory Data Analysis (EDA)

# Model Development

# Model Evaluation

# Forecasting & Trend Analysis

# Streamlit App Integration

# Insights & Conclusion