# **Section 1: Introduction & Setup**

## **1.1. General Information**

*   **Notebook Objective:** Perform a *'diagnostic investigation'* of the raw dataset. We will act as *'data detectives'* to identify and document problems. This analysis will **not** clean data, but will build the evidence-based case for *why* preparation is necessary.

*   **Methodology:**
    
    *   **Data Inspection:** Use the **Pandas** library for loading and structural analysis.
    
    *   **Visualization Strategy:** **Plotly** is the primary tool. Its key advantage is **interactivity**, enabling *'dynamic exploration'* of the data (hovering for values, zooming into dense areas). **Seaborn** and **Matplotlib** are retained for high-quality static plots and low-level customizations if needed.
    
    *   **Analytical Approach:** A two-pronged investigation:
        1.  Diagnose the central **metadata file** (`News_Final.csv`).
        
        2.  Diagnose the **social feedback files**, focusing on **data fragmentation** and the *'wide format'* structural issue.

*   **Expected Outcome:** A comprehensive summary of documented data quality issues. This summary serves as the formal *'problem statement'* that justifies the work in `02_Preparation_and_Analysis.ipynb`.

## **1.2. Library Imports & Global Configuration**

In [1]:
# Core libraries for data manipulation and analysis
import pandas as pd
import numpy as np

# Libraries for data visualization
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import seaborn as sns
import matplotlib.pyplot as plt # Retained for specific low-level customizations

# Utilities for file system operations
import os
import glob


In [2]:
# --- Configuration ---
# Import and register our custom visual theme
import custom_template as ct

# Set the default template by layering our custom theme over a clean base
pio.templates["custom_template_raw"] = ct.custom_template
pio.templates["custom_template"] = "plotly_white+custom_template_raw"
pio.templates.default = "custom_template"

print("Custom Plotly template 'custom_template' has been registered and set as default.")

Custom Plotly template 'custom_template' has been registered and set as default.


# **Section 2: Diagnosing the Metadata (News_Final.csv)**

## **2.1. Initial Load & Structural Overview**

*   **General Information**
    
    *   **Objective:** Gain a high-level structural overview of the main data file.
    
    *   **Inspection Points:** DataFrame **dimensions** (rows, columns), **data types (`dtypes`)**, and **memory footprint**.
    
    *   **Goal:** Identify immediate, critical issues like *'incorrect data types'* or significant numbers of missing values.

In [3]:
# --- Section 2.1: Initial Load & Structural Overview ---

# Define the path to the data folder
# Note: Use a relative path like "data/" for portability, or an absolute path if needed.
filepath = "Data/News_Final.csv"
# Construct the full path to the metadata file

# Load the metadata file into a pandas DataFrame
try:
    df_meta = pd.read_csv(filepath)
    print("Successfully loaded News_Final.csv.")
    print(f"Dataset dimensions: {df_meta.shape[0]} rows and {df_meta.shape[1]} columns.")
except FileNotFoundError:
    print(f"ERROR: News_Final.csv not found at the specified path: {filepath}")
    # Create an empty DataFrame to prevent subsequent cells from crashing
    df_meta = pd.DataFrame()

# Display a detailed summary of the DataFrame's structure and data types
if not df_meta.empty:
    print("\n--- DataFrame Structural Information (dtypes, non-null counts) ---")
    df_meta.info(verbose=True) # verbose=True provides detailed column info

# Display the first 8 rows to get a visual sense of the data
if not df_meta.empty:
    print("\n--- First 8 Rows of the Dataset ---")
    display(df_meta.head(8))

Successfully loaded News_Final.csv.
Dataset dimensions: 93239 rows and 11 columns.

--- DataFrame Structural Information (dtypes, non-null counts) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93239 entries, 0 to 93238
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   IDLink             93239 non-null  float64
 1   Title              93239 non-null  object 
 2   Headline           93224 non-null  object 
 3   Source             92960 non-null  object 
 4   Topic              93239 non-null  object 
 5   PublishDate        93239 non-null  object 
 6   SentimentTitle     93239 non-null  float64
 7   SentimentHeadline  93239 non-null  float64
 8   Facebook           93239 non-null  int64  
 9   GooglePlus         93239 non-null  int64  
 10  LinkedIn           93239 non-null  int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 7.8+ MB

--- First 8 Rows of the Dataset ---


Unnamed: 0,IDLink,Title,Headline,Source,Topic,PublishDate,SentimentTitle,SentimentHeadline,Facebook,GooglePlus,LinkedIn
0,99248.0,Obama Lays Wreath at Arlington National Cemetery,Obama Lays Wreath at Arlington National Cemete...,USA TODAY,obama,2002-04-02 00:00:00,0.0,-0.0533,-1,-1,-1
1,10423.0,A Look at the Health of the Chinese Economy,"Tim Haywood, investment director business-unit...",Bloomberg,economy,2008-09-20 00:00:00,0.208333,-0.156386,-1,-1,-1
2,18828.0,Nouriel Roubini: Global Economy Not Back to 2008,"Nouriel Roubini, NYU professor and chairman at...",Bloomberg,economy,2012-01-28 00:00:00,-0.42521,0.139754,-1,-1,-1
3,27788.0,Finland GDP Expands In Q4,Finland's economy expanded marginally in the t...,RTT News,economy,2015-03-01 00:06:00,0.0,0.026064,-1,-1,-1
4,27789.0,"Tourism, govt spending buoys Thai economy in J...",Tourism and public spending continued to boost...,The Nation - Thailand&#39;s English news,economy,2015-03-01 00:11:00,0.0,0.141084,-1,-1,-1
5,27790.0,Intellitec Solutions to Host 13th Annual Sprin...,Over 100 attendees expected to see latest vers...,PRWeb,microsoft,2015-03-01 00:19:00,-0.075378,0.036773,-1,-1,-1
6,80690.0,"Monday, 29 Feb 2016","RAMALLAH, February 25, 2016 (WAFA) - Palestine...",,palestine,2016-02-28 14:03:00,0.0,-0.005906,0,0,0
7,80762.0,"Obama, stars pay a musical tribute to Ray Charles",First lady Michelle Obama speaks in the State ...,Coast Reporter,obama,2015-03-01 00:45:00,0.083333,0.103003,-1,-1,-1


**Output Analysis**

The `News_Final.csv` metadata file was successfully loaded, containing **93,239 rows** and **11 columns**. The initial structural overview immediately reveals several critical issues that require data preparation:

*   `PublishDate` **Data Type Error:**
    *   The column is an `object` (string), not a `datetime` object.
    *   **Implication:** This makes it **unusable for any time-series analysis**, such as plotting trends or extracting features like *'hour of day'* or *'day of week'*. This is the **highest priority problem** to diagnose and fix.

*   `IDLink` **Data Type Inefficiency:**
    *   The unique identifier is a `float64`, which is inappropriate for an ID.
    *   **Implication:** This is memory-inefficient and can lead to precision errors during merges. It must be converted to an `integer` or `string` for reliable joining.

*   **Missing Categorical Data:**
    *   The `Source` column is missing **279 values**, and the `Headline` column is missing **15 values**.
    *   **Implication:** These missing entries must be handled (e.g., filled with a placeholder like *'Unknown'* or analyzed for removal) to ensure data integrity in subsequent grouping operations.

*   **Data Quality Concerns:**
    *   A visual inspection of the output (e.g., `IDLink` 80690) shows potential data entry errors, such as a date appearing in the `Title` column and a blank `Source`.
    *   **Implication:** This suggests the presence of noisy, inconsistent data beyond the standard `NaN` values, reinforcing the need for a thorough cleaning phase.

## **2.2. Popularity Columns: Placeholders and Skewness**

*   **General Information**
    
    *   **Objective:** Validate the contents of the primary metric columns: `Facebook`, `GooglePlus`, and `LinkedIn`.
    
    *   **Methodology:** Use **descriptive statistics** (`.describe()`) to inspect the range, central tendency, and spread of these columns.
    
    *   **Goal:** Uncover anomalies, such as placeholder values, that would invalidate standard statistical calculations.


In [4]:
# --- Section 2.2: Diagnosing Popularity Columns ---

if not df_meta.empty:
    # Select only the popularity columns for a focused summary
    popularity_columns = ['Facebook', 'GooglePlus', 'LinkedIn']
    
    # Generate and display descriptive statistics
    print("--- Descriptive Statistics for Popularity Columns ---")
    display(df_meta[popularity_columns].describe())

--- Descriptive Statistics for Popularity Columns ---


Unnamed: 0,Facebook,GooglePlus,LinkedIn
count,93239.0,93239.0,93239.0
mean,113.141336,3.888362,16.547957
std,620.173233,18.492648,154.459048
min,-1.0,-1.0,-1.0
25%,0.0,0.0,0.0
50%,5.0,0.0,0.0
75%,33.0,2.0,4.0
max,49211.0,1267.0,20341.0


**Output Analysis**

The descriptive statistics for the popularity columns reveal two major, intertwined problems that are **systemic across all three platforms**, rendering the raw data misleading for analysis:

*   **Contamination by Placeholder Values:**
    *   The **`min` value** for `Facebook`, `GooglePlus`, and `LinkedIn` is consistently **-1**.
    *   **Implication:** This definitively confirms that `-1` is a universal placeholder for untracked or missing data, not a real score. Its presence **corrupts all statistical measures** (mean, std, quartiles) for *every platform*, making them mathematically incorrect and unusable for direct comparison.

*   **Extreme Right-Skew and Outliers:**
    *   A massive disparity exists between the **75th percentile** and the **`max` value** across all platforms. For instance, 75% of articles on `Facebook` have 33 shares or fewer, yet the maximum is 49,211. Similarly, for `LinkedIn`, the 75th percentile is 4 while the maximum is 20,341.
    *   **Implication:** The data for each platform is dominated by a few "viral" articles. This heavy **right-skew** means the **`mean` is not a reliable measure of central tendency** for any of the platforms, as it's heavily inflated by these outliers. For any effective analysis or visualization, a transformation (such as a **logarithmic transformation**) will be essential.


**Visualization**

*   **Objective:** To create a consolidated figure with three subplots that visually proves the two primary issues—placeholder contamination and extreme skew—are systemic problems across all three platforms.

*   **Methodology:** Use Plotly's `make_subplots` function to create a 1x3 grid of histograms, one for each popularity column (`Facebook`, `GooglePlus`, `LinkedIn`). This side-by-side comparison is a powerful technique for highlighting shared patterns in different data series.

*   **Expected Outcome:** A single, powerful figure that serves as undeniable evidence that the popularity data for all platforms is unsuitable for analysis in its raw state.

In [5]:
# --- Section 2.2: Visualization of All Popularity Columns (Spacing Adjusted) ---

if not df_meta.empty:
    # Use the layout argument during initialization to apply the template
    fig = make_subplots(
        rows=1,
        cols=3,
        subplot_titles=("<b>Facebook Popularity</b>", "<b>GooglePlus Popularity</b>", "<b>LinkedIn Popularity</b>"),
        figure=go.Figure(layout=go.Layout(template=pio.templates.default)),
    )

    # --- Add Facebook Histogram ---
    fig.add_trace(
        go.Histogram(
            x=df_meta["Facebook"],
            nbinsx=100,
            name="Facebook",
        ),
        row=1,
        col=1,
    )

    # --- Add GooglePlus Histogram ---
    fig.add_trace(
        go.Histogram(
            x=df_meta["GooglePlus"],
            nbinsx=100,
            name="GooglePlus",
        ),
        row=1,
        col=2,
    )

    # --- Add LinkedIn Histogram ---
    fig.add_trace(
        go.Histogram(
            x=df_meta["LinkedIn"],
            nbinsx=100,
            name="LinkedIn",
        ),
        row=1,
        col=3,
    )

    # --- Update Layout and Titles ---
    fig.update_layout(
        # NEW: Use title dictionary for precise control
        title=dict(
            text="<b>Distribution of Raw Popularity Scores Across Platforms</b><br><span style='font-size:18px;'>Visual evidence of '-1' placeholders and extreme right-skew</span>",
            font=dict(size=24),
            y=0.90,  # Adjust vertical position (1 is top, 0 is bottom)
            yanchor="top",
        ),
        showlegend=False,
        height=500,
        # NEW: Increased top margin to give the title more space
        margin=dict(t=150, b=80, l=80, r=50),
    )

    # NEW: Adjust font size for subplot titles
    for annotation in fig["layout"]["annotations"]:
        annotation["font"]["size"] = 16

    # Update axis labels for all subplots
    fig.update_xaxes(
        title_text="<b><i>Number of Shares (Raw)</i></b>", title_font=dict(size=16), tickfont=dict(size=14), row=1
    )
    fig.update_yaxes(title_text="<b><i>Count of Articles</i></b>", row=1, col=1)
    fig.update_yaxes(title_font=dict(size=16), tickfont=dict(size=14), row=1, range=[0, 95000])

    fig.show()

The visualization is **correct** and serves as powerful, direct evidence confirming the two critical issues identified in our statistical analysis.

*   **Visual Confirmation of Placeholder & Low-Engagement Dominance:**
    
    *   In all three subplots, the distribution is dominated by a single, **massive bar near the origin**. This bar visually represents the tens of thousands of articles with either untracked (`-1`) or zero (`0`) popularity scores.
    
    *   **Implication:** The immense scale of this single bar **completely flattens and obscures the distribution** of all other articles that received genuine engagement, making the chart nearly unreadable for understanding the spread of popular articles.

*   **Visual Confirmation of Extreme Right-Skew (The "Long Tail"):**
    
    *   For each platform, the x-axis extends far to the right (e.g., beyond 40k for Facebook), but the corresponding bars for these high values are so small they are **practically invisible**.
    
    *   **Implication:** This is the classic visual representation of a dataset with extreme outliers or a **"long tail"**. It proves that a tiny number of "viral" articles have scores thousands of times higher than the median article. This visual skew confirms that raw statistical measures like the `mean` are misleading and that the data **requires a transformation** (e.g., logarithmic) to be analyzed and visualized effectively.

## **2.3. Categorical Columns: Inconsistency and Granularity**

*   **General Information**
    
    *   **Objective:** Diagnose the quality and analytical readiness of all non-numeric columns in the metadata file.
    
    *   **Methodology:** Classify each column by its data type and apply the appropriate diagnostic technique.
        *   **Low-Cardinality Categorical (`Topic`):** Assess distribution and usability.
        
        *   **High-Cardinality Categorical (`Source`):** Quantify the *'long tail problem'*.
        
        *   **Unstructured Text (`Title`, `Headline`):** Assess uniqueness and basic integrity (missing values).
    
    *   **Goal:** Formally justify the specific data preparation steps needed for each type of non-numeric data.


In [6]:
# --- Section 2.3: Overall Cardinality of Object-Type Columns ---
    
if not df_meta.empty:
    # Get all columns with dtype 'object'
    object_columns = df_meta.select_dtypes(include=['object']).columns
    
    # Calculate cardinality for each
    cardinality = df_meta[object_columns].nunique()
    
    # Create a summary DataFrame
    cardinality_df = pd.DataFrame({
        'Column': cardinality.index,
        'Unique_Values': cardinality.values
    })
    
    print("--- Cardinality Report for Non-Numeric Columns ---")
    display(cardinality_df)

--- Cardinality Report for Non-Numeric Columns ---


Unnamed: 0,Column,Unique_Values
0,Title,81259
1,Headline,86694
2,Source,5756
3,Topic,4
4,PublishDate,82644


**Output Analysis**

*   `Topic` shows a very low number of unique values => **simple categorical feature**

*   `Source` shows a very high number => **high-cardinality feature requiring cleaning**

*   `Title` and `Headline` show unique values approaching the total number of rows => **unstructured text, not suitable for direct grouping**

*   `PublishDate` also shows high uniqueness => **consistent with its nature as a timestamp**

### **2.3.1. Low-Cardinality Categorical (`Topic`)**

*   Characterized by a **small, manageable number of unique values**.

    => Confirm the feature is **clean and well-distributed, ready for grouping and comparative analysis**.

In [7]:
# --- Section 2.3.1: Diagnosing the 'Topic' Column ---

if not df_meta.empty:
    # 1. Inspect the value distribution for the 'Topic' column
    print("--- Value Distribution for 'Topic' Column ---")
    topic_value_counts = df_meta['Topic'].value_counts()
    display(topic_value_counts)

    # 2. Check for any standard missing values (NaN)
    missing_topics = df_meta['Topic'].isna().sum()
    print(f"\nNumber of missing values in 'Topic' column: {missing_topics}")

--- Value Distribution for 'Topic' Column ---


Topic
economy      33928
obama        28610
microsoft    21858
palestine     8843
Name: count, dtype: int64


Number of missing values in 'Topic' column: 0


In [8]:
# --- Visualization 2.3.1: 'Topic' Column Distribution ---

if not df_meta.empty:
    topic_counts = df_meta["Topic"].value_counts().reset_index()
    topic_counts.columns = ["Topic", "Count"]
    topic_counts = topic_counts.sort_values("Count", ascending=False)

    fig_topic = go.Figure(layout=go.Layout(template=pio.templates.default))

    fig_topic.add_trace(
        go.Bar(
            x=topic_counts["Topic"],
            y=topic_counts["Count"],
            marker_color=topic_counts["Count"],
            marker_colorscale=[[0, "#ffcd90"], [1, "#ff7b00"]],
            text=topic_counts["Count"],
            texttemplate="<b>%{text:.3s}</b>",
            textposition="outside",
            textfont=dict(size=16),  # NEW: Set font size for bar values
        )
    )

    # Update the layout with new title font sizes
    fig_topic.update_layout(
        # NEW: Use dict for precise control over main title and subtitle font sizes
        title=dict(
            text="<b>Distribution of Articles by Topic</b><br><span style='font-size:18px;'>A clean, low-cardinality feature</span>",
            font=dict(size=24),
        ),
        xaxis={"categoryorder": "total descending"},
        coloraxis_showscale=False,
    )

    # --- NEW: Update axis fonts ---
    fig_topic.update_xaxes(title_text="<b><i>News Topic</i></b>", title_font=dict(size=18), tickfont=dict(size=16))
    fig_topic.update_yaxes(
        title_text="<b><i>Number of Articles</i></b>",
        title_font=dict(size=18),
        tickfont=dict(size=16),
        range=[0, topic_counts["Count"].max() * 1.15],  # Increased headroom slightly for larger font
    )

    fig_topic.show()


**Analysis of Code and Visualization**

*   **Low Cardinality Confirmed:**
    *   Code output shows **4 distinct values**, a very low and manageable number.
    
    *   Bar chart visually confirms this, displaying exactly four clean categories.

*   **Well-Distributed and Complete:**
    *   Value counts and bar heights show all topics have a substantial number of articles (ranging from 8.84k to 33.9k). There are **no rare or insignificant categories**.
    
    *   The code confirms **zero missing values**, indicating a complete feature.

==> `Topic` column is a **high-quality, analysis-ready feature**. It requires **no data preparation** and is a primary candidate for grouping and segmentation in our final story.

### **2.3.2. High-Cardinality Categorical (`Source`)**

*   Characterized by a **large number of unique values**, often with a **highly skewed distribution**.

    => Prove that the feature is **too noisy and granular for direct use**. Justify the need for a preparation step *(grouping or standardization)* to reduce its complexity.

In [9]:
# --- Section 2.3.2: Diagnosing the 'Source' Column ---

if not df_meta.empty:
    # 1. Inspect the value distribution (Top 20 and Bottom 20)
    print("--- Top 20 Most Frequent Sources ---")
    display(df_meta['Source'].value_counts().head(20))
    
    print("\n--- 20 Least Frequent Sources ---")
    display(df_meta['Source'].value_counts().tail(20))

    # 2. Check for standard missing values (NaN)
    missing_sources = df_meta['Source'].isna().sum()
    print(f"\nNumber of missing values in 'Source' column: {missing_sources}")

--- Top 20 Most Frequent Sources ---


Source
Bloomberg                     1732
Reuters                       1321
ABC News                      1098
New York Times                 992
The Guardian                   933
Business Insider               884
Economic Times                 787
Forbes                         781
Washington Post                774
CNN                            742
Wall Street Journal            730
WinBeta                        707
CNBC                           682
Huffington Post                676
Breitbart News                 605
Reuters via Yahoo! Finance     551
The Hill                       548
Financial Times                544
USA TODAY                      530
ZDNet                          526
Name: count, dtype: int64


--- 20 Least Frequent Sources ---


Source
Crain's Chicago Business (blog)                                   1
Design Week                                                       1
Hypergrid Business                                                1
IT Business Edge                                                  1
Pharmaceutical Executive (press release) (registration) (blog)    1
OK! Magazine                                                      1
KTVQ Billings News                                                1
Kasmir Monitor                                                    1
Essex Chronicle                                                   1
TwoCircles.net                                                    1
Around the Rings (subscription)                                   1
TG Daily                                                          1
Lifestyles                                                        1
Ministry of Foreign Affairs of Denmark                            1
Seeker                                   


Number of missing values in 'Source' column: 279


In [None]:
# --- Visualization 2.3.2: 'Source' Column Distribution (Top 20) ---

if not df_meta.empty:
    source_counts = df_meta["Source"].value_counts().head(20).reset_index()
    source_counts.columns = ["Source", "Count"]
    source_counts = source_counts.sort_values("Count", ascending=True)

    total_sources = df_meta["Source"].nunique()

    fig_source = go.Figure(layout=go.Layout(template=pio.templates.default))

    fig_source.add_trace(
        go.Bar(
            x=source_counts["Count"],
            y=source_counts["Source"],
            orientation="h",
            marker_color=source_counts["Count"],
            marker_colorscale=[[0, "#eedee1"], [0.3, "#ff98b7"], [0.6, "#ff4d6e"], [1, "#ff1d61"]]
        )
    )

    # Update the layout with new title font sizes
    fig_source.update_layout(
        # NEW: Use dict for precise control over main title and subtitle font sizes
        title=dict(
            text=f"<b>Distribution of Top 20 News Sources (out of {total_sources})</b><br><span style='font-size:18px;'>Visual evidence of the 'long tail problem'</span>",
            font=dict(size=24),
        ),
        bargap=0.4,
        coloraxis_showscale=False,
        height=750,  # Set a fixed height to ensure labels don't overlap
        margin=dict(l=200),  # Ensure enough left margin for long source names
    )

    # --- NEW: Update axis fonts ---
    fig_source.update_xaxes(
        title_text="<b><i>Number of Articles</i></b>",
        title_font=dict(size=18),
        tickfont=dict(size=14),
    )
    fig_source.update_yaxes(
        title_text="<b><i>News Source</i></b>",
        title_font=dict(size=18),
        tickfont=dict(size=14),
        tickvals=source_counts["Source"],  # Keep the fix for showing all labels
        ticktext=source_counts["Source"],
    )

    fig_source.show()


**Analysis of Code and Visualization**

*   **Extreme Cardinality Confirmed:**
    *   Total unique value count is **5,756** => immediate **red flag** for a categorical feature.

*   **"Long Tail" Distribution Proven:**
    
    *   Code output shows the top sources (`Bloomberg`, `Reuters`) have **over 1,000 articles** each, while the least frequent sources appear only **once**.
    
    *   Bar chart visually confirms this **steep drop-off in frequency**, showing a *high concentration of articles among a few publishers*. The title explicitly notes these are only the Top 20 of over 5,700 sources.

*   **Missing Values Identified:**
    *   The code confirms **279 missing `Source` values**, adding another layer of *required cleaning*.

==> `Source` column is **too granular and noisy for effective analysis** in its raw state. Any attempt to group by this feature would create thousands of categories, making visualizations unreadable and insights meaningless. A **mandatory data preparation step** is required to **consolidate the long tail of infrequent sources into an 'Other' category, handle missing values, or to standardize source names** *(e.g., merging 'Reuters' and 'Reuters via Yahoo! Finance')*..

### **2.3.3. Unstructured Text (`Title`, `Headline`)**

*   **Characteristics:** These columns contain **unique or near-unique string values**. While they are not categorical, they hold high-density information regarding the article's **tone, complexity, and emotional appeal**.

*   **Diagnostic Goal:**
    1.  Assess basic data integrity (missing values).
    
    2.  **Qualitative Inspection:** Determine if the text content is rich enough to support **Natural Language Processing (NLP)**. We need to verify if the headlines contain extractable signals (e.g., sentiment, subjectivity) to answer our **"Content DNA"** strategic questions.

In [11]:
# --- Section 2.3.3: Diagnosing Unstructured Text Columns ---
if not df_meta.empty:
    text_cols = ['Title', 'Headline']
    
    # 1. Quantitative: Integrity Check
    # Calculate the percentage of unique values
    uniqueness_ratio = (df_meta[text_cols].nunique() / len(df_meta)) * 100
    # Calculate missing value counts
    missing_counts = df_meta[text_cols].isna().sum()
    missing_ratio = (missing_counts / len(df_meta)) * 100

    # Create a summary DataFrame
    text_summary_df = pd.DataFrame({
        'Uniqueness (%)': uniqueness_ratio.round(2),
        'Missing Values': missing_counts,
        'Missing (%)': missing_ratio.round(2)
    })

    print("--- Quantitative Summary (Integrity) ---")
    display(text_summary_df)

    # 2. Qualitative: Content Sampling
    # Display random samples to inspect the "nature" of the language used
    print("\n--- Qualitative Inspection (Random Sample of Content) ---")
    pd.set_option('display.max_colwidth', 150) # Expand width to read full text
    display(df_meta[text_cols].sample(5, random_state=42))
    pd.reset_option('display.max_colwidth')

--- Quantitative Summary (Integrity) ---


Unnamed: 0,Uniqueness (%),Missing Values,Missing (%)
Title,87.15,0,0.0
Headline,92.98,15,0.02



--- Qualitative Inspection (Random Sample of Content) ---


Unnamed: 0,Title,Headline
17026,"Obama, Nerd President, Needs to Get Right With 'Star Wars' Fans","President Barack Obama knocked off work on Friday afternoon to screen the new Star Wars film """"""The Force Awakens"""""" at the White House"
70005,"Johannesburg committed to developing job intensive economy, says ...","THE City of Johannesburg has committed itself to developing a resilient and job intensive economy, executive mayor Parks Tau said on"
22941,Whose President Was He?,"""""""If I spent all my time thinking about it, I'd be paralyzed,"""""" Barack Obama told me. """"""And frankly, the voters would justifiably say, 'I need ..."
66136,"Argentina pays off 'holdout' bondholders, elevating hopes for economy","In recent years, soybean farmer Mario Caceres had to pay interest rates of up to 50% on the bank loans he needed to buy planting equipment,"
78589,"Obama: Trump displays ignorance, seeks tweets over solutions","U.S. President Barack Obama disparaged U.S. Republican presidential candidate Donald Trump, saying the billionaire seeks tweets over"


**Analysis of Unstructured Text Diagnostics**

**1. High Uniqueness & Data Integrity**

*   **Confirmation of Type:** The high uniqueness ratios for `Title` (**87.15%**) and `Headline` (**92.98%**) definitively confirm that these are **unstructured text fields**, not hidden categorical variables. They cannot be grouped or analyzed via frequency counts in their current form.

*   **Technical Readiness:** The data is remarkably complete. `Title` has **zero missing values**, and `Headline` has a negligible **0.02%** missing rate (15 rows). This indicates the dataset is technically sound and ready for advanced processing without heavy imputation.

**2. Qualitative Content Inspection (The Engineering Opportunity)**

A visual inspection of the random sample reveals high-density semantic information that justifies our "Content DNA" strategy:

*   **Varying Tone:** The sample ranges from casual/cultural references (*"Obama, Nerd President"*) to formal economic reporting (*"Johannesburg committed to developing..."*).

*   **Emotional Signals:** There is clear evidence of sentiment-heavy language. Words like *"ignorance"* (Row 78589) suggest negative sentiment, while *"elevating hopes"* (Row 66136) suggests positive sentiment.

*   **Structural Variety:** We see standard declarative sentences alongside short, interrogative titles (*"Whose President Was He?"* - Row 22941), which hints at "clickbait" or curiosity-gap styling.

==> **Strategic Implication:** In their current raw state (strings), these columns are **analytically silent**. To answer our **Chapter 1** questions, standard cleaning is insufficient. We **must apply Natural Language Processing (NLP)** in the next notebook to mathematically extract **Sentiment Scores** and **Cognitive Complexity**, converting these qualitative nuances into quantitative metrics.

# **Section 3: Diagnosing the Social Feedback Files**

**General Information**

*   **Objective:** Diagnose the systemic issues related to **data fragmentation** and the **unsuitable data structure** of the 12 social feedback files.

*   **Methodology:** We load a single representative file to prove the systemic issues. This avoids the computational cost of loading all 12 files before we have a strategy to handle them.

*   **Inspection Points:**
    
    1.  **Fragmentation:** The existence of multiple, siloed CSV files prevents holistic analysis.
    
    2.  **Wide Format Structure:** The presence of 144 columns (`TS1` to `TS144`) prevents vector analysis (velocity, acceleration).

*   **Goal:** Formally justify the need for **Data Integration** and **Structural Reshaping (`melt`)** to unlock advanced feature engineering.


In [14]:
# --- Section 3: Diagnosing the Social Feedback Files ---

if not df_meta.empty:
    # Use the relative path ('Data' folder in the same directory as notebook)
    data_path = "Data/"

    # Use glob to find all feedback files, excluding the main metadata file
    all_csv_files = glob.glob(os.path.join(data_path, "*.csv"))
    feedback_files = [f for f in all_csv_files if "News_Final.csv" not in f]

    # 1. --- Prove the Fragmentation Problem ---
    print(f"--- Data Fragmentation Diagnosis ---")
    print(f"Found {len(feedback_files)} separate social feedback files:")
    # Print the first 5 for brevity
    for file in feedback_files[:5]:
        print(f"  - {os.path.basename(file)}")
    if len(feedback_files) > 5:
        print("  - ... and more.")

    # 2. --- Prove the Wide Format Problem ---
    print("\n--- Wide Format Structural Diagnosis ---")
    if feedback_files:
        # Load just one file as a representative sample
        sample_file_path = feedback_files[0]
        try:
            df_sample_feedback = pd.read_csv(sample_file_path)

            print(f"Analyzing sample file: '{os.path.basename(sample_file_path)}'")
            print(f"Sample dimensions: {df_sample_feedback.shape[0]} rows and {df_sample_feedback.shape[1]} columns.")

            print("\nFirst 5 columns of the sample file:")
            display(df_sample_feedback.iloc[:, :5].head())

            # Explicitly show the TS columns
            ts_columns_sample = [col for col in df_sample_feedback.columns if 'TS' in col]
            print(f"\nFound {len(ts_columns_sample)} columns representing time slices (e.g., '{ts_columns_sample[0]}', '{ts_columns_sample[1]}', ...).")

        except FileNotFoundError:
            print(f"ERROR: Sample feedback file not found at path: {sample_file_path}")
    else:
        print("No feedback files found to analyze.")

--- Data Fragmentation Diagnosis ---
Found 12 separate social feedback files:
  - Facebook_Economy.csv
  - Facebook_Microsoft.csv
  - Facebook_Obama.csv
  - Facebook_Palestine.csv
  - GooglePlus_Economy.csv
  - ... and more.

--- Wide Format Structural Diagnosis ---
Analyzing sample file: 'Facebook_Economy.csv'
Sample dimensions: 29928 rows and 145 columns.

First 5 columns of the sample file:


Unnamed: 0,IDLink,TS1,TS2,TS3,TS4
0,1,-1,-1,-1,-1
1,2,-1,-1,-1,-1
2,3,-1,-1,-1,-1
3,4,-1,-1,-1,-1
4,5,-1,-1,-1,-1



Found 144 columns representing time slices (e.g., 'TS1', 'TS2', ...).


**Output Analysis**

*   **Data Fragmentation Confirmed:**
    
    *   The script identified **12 separate CSV files**, siloed by `Platform` and `Topic`.
    
    => **Strategic Barrier:** We cannot calculate a "Global Opportunity Score" or compare "Economy vs. Microsoft" velocity because the data is physically separated. **Integration is mandatory.**

*   **"Wide Format" Problem Proven:**
    
    *   The file contains **144 separate columns** (`TS1`...`TS144`) for time.
    
    => **Velocity Barrier:** To calculate **Initial Velocity** (growth in the first 20 mins), we would need to compute `TS2 - TS1` for every row. To calculate **Peak Acceleration**, we would need to compute the derivative across 144 columns. This is computationally inefficient and rigid.
    
    => **Fix:** We must **reshape (melt)** the data into a "Long Format" where `TimeSlice` is a single variable. This allows us to use vectorized operations to calculate speed, stickiness, and acceleration efficiently.

In [15]:
# --- Section 3: Conceptual Visualization of the Structural Problem ---

fig = go.Figure()

# 1. Boxes representing the data states
shapes = [
    # Box for Fragmented Files (Problem 1)
    go.layout.Shape(type="rect", xref="x", yref="y", x0=0.5, y0=1.5, x1=2.5, y1=2.5, line=dict(color="#d62728", width=2)),
    go.layout.Shape(type="rect", xref="x", yref="y", x0=0.5, y0=0.5, x1=2.5, y1=1.4, line=dict(color="#d62728", width=2)),
    go.layout.Shape(type="rect", xref="x", yref="y", x0=0.5, y0=-0.5, x1=2.5, y1=0.4, line=dict(color="#d62728", width=2)),

    # Box for Integrated "Wide" Data (Problem 2)
    go.layout.Shape(type="rect", xref="x", yref="y", x0=4, y0=0, x1=7, y1=2, line=dict(color="#ff7f0e", width=2)),

    # Box for Final "Long" Data (Required State)
    go.layout.Shape(type="rect", xref="x", yref="y", x0=8.5, y0=0, x1=11, y1=2, line=dict(color="#1f77b4", width=2))
]

# 2. Text annotations for each part of the diagram
annotations = [
    # --- Stage 1: Fragmentation ---
    go.layout.Annotation(text="<b>Problem 1: Fragmentation</b>", x=1.5, y=3.5, showarrow=False, font=dict(color="#d62728", size=16)),
    go.layout.Annotation(text="<b>Facebook_Economy.csv</b>", x=1.5, y=2, showarrow=False, font_size=16),
    go.layout.Annotation(text="<b>Facebook_Microsoft.csv</b>", x=1.5, y=1, showarrow=False, font_size=16),
    go.layout.Annotation(text="<b>... (12 files total)</b>", x=1.5, y=0, showarrow=False, font_size=16),

    # --- Stage 2: Wide Format ---
    go.layout.Annotation(text="<b>Problem 2: 'Wide' Format</b>", x=5.5, y=3.5, showarrow=False, font=dict(color="#ff7f0e", size=16)),
    go.layout.Annotation(text="<b>ID | TS1 | TS2 | ... | TS144</b>", x=5.5, y=1, showarrow=False, font=dict(family="Courier New", size=16)),

    # --- Stage 3: Long Format (Goal) ---
    go.layout.Annotation(text="<b>Required: 'Long' Format</b>", x=9.75, y=3.5, showarrow=False, font=dict(color="#1f77b4", size=16)),
    go.layout.Annotation(text="<b>ID | TimeSlice | Value</b>", x=9.75, y=1, showarrow=False, font=dict(family="Courier New", size=16)),

    # --- Process Labels (on top of arrows) ---
    go.layout.Annotation(text="Integrate", x=3.25, y=1.3, showarrow=False, font=dict(size=14)),
    go.layout.Annotation(text="Reshape (Melt)", x=7.75, y=1.3, showarrow=False, font=dict(size=14)),

    # --- Process Arrows (perfectly horizontal) ---
    go.layout.Annotation(text="", showarrow=True, arrowhead=2, arrowwidth=2, ax=2.6, ay=1, x=3.9, y=1, axref='x', ayref='y'),
    go.layout.Annotation(text="", showarrow=True, arrowhead=2, arrowwidth=2, ax=7.1, ay=1, x=8.4, y=1, axref='x', ayref='y'),
]

# --- Update layout to create the diagram ---
fig.update_layout(
    title_text="<b>Conceptual Diagram of Structural Data Problems</b>",
    title_font=dict(size=24),
    shapes=shapes,
    annotations=annotations,
    xaxis=dict(showgrid=False, zeroline=False, visible=False, range=[0, 11.5]),
    yaxis=dict(showgrid=False, zeroline=False, visible=False, range=[-1, 4.5]),
    height=400
)

fig.show()

# **Section 4: Summary of Findings & Handoff**

## **4.1. Case for Data Preparation & Engineering**

The raw dataset is not only **fragmented and structurally unsuitable** but also **analytically shallow** in its current form. To answer our strategic business questions (Content DNA, Viral Dynamics, Market Ecology), we must go beyond simple cleaning. This summary serves as the formal mandate for important phase later: **Advanced Feature Engineering**.

## **4.2. Required Preparation & Engineering Actions**

**I. Structural & Integration Issues**
1.  **Data Fragmentation:**
    *   **Problem:** Time-series data is scattered across 12 separate CSV files.
    *   **Action:** Integrate all feedback files and metadata into a single master DataFrame to enable **cross-platform benchmarking**.

2.  **Unsuitable "Wide" Format:**
    *   **Problem:** Time-series data uses 144 columns (`TS1`...`TS144`), preventing dynamic analysis.
    *   **Action:** Perform a **structural transformation (`melt`)** to reshape data into "Long Format."
    *   **Strategic Goal:** This is required to calculate **Vector Metrics** (Initial Velocity, Acceleration, and Stickiness).

**II. Data Integrity & Type Issues**
1.  **Incorrect `PublishDate` Data Type:**
    *   **Problem:** The `PublishDate` is a string (`object`).
    *   **Action:** Convert to `datetime` to engineer **temporal features** (Hour, Day) and calculate **Market Saturation** (rolling window counts).

2.  **Placeholder Values:**
    *   **Problem:** Popularity columns use **-1** as a placeholder.
    *   **Action:** Replace with **`np.nan`** to prevent statistical corruption.

**III. Data Quality & Distribution Issues**
1.  **High-Cardinality `Source` Column:**
    *   **Problem:** Over 5,700 unique sources create noise.
    *   **Action:** **Consolidate** into categories and engineer a **`Source_Tier`** feature (Tier 1 Mainstream vs. Tier 3 Niche) for "David vs. Goliath" analysis.

2.  **Extreme Skewness:**
    *   **Problem:** Outliers dominate the distribution.
    *   **Action:** Apply **logarithmic transformation** for visualization and anomaly detection.

**IV. Strategic Feature Engineering Requirements**
*   **Problem:** Text columns (`Headline`, `Title`) are unstructured strings; we cannot measure "tone" or "clickbait."

*   **Action:** Apply **Natural Language Processing (NLP)** to extract:
    *   **Sentiment Scores:** (Positive/Negative).
    *   **Subjectivity:** (Fact vs. Opinion).
    *   **Cognitive Complexity:** (Readability/Word count).

*   **Problem:** Articles are treated as isolated events.

*   **Action:** Engineer an **`Opportunity_Score`** to measure competition density at the time of publication.