# **A0. Project Kick-off: Dataset Overview and Analytical Framing**

## **A0.1. Introduction to the Dataset**

Our project will analyze the *"News Popularity in Multiple Social Media Platforms"* dataset. This dataset is uniquely suited for our objective — demonstrating the power of data preparation, because its raw form presents significant, realistic barriers to analysis.

The data is provided in **13 separate files**, which can be categorized into two distinct types:

*   **Metadata File (`News_Final.csv`):** This is the central anchor of our dataset. It contains descriptive information for each of the ~93,000 news articles, including:
    *   `IDLink`: A unique identifier for each article.
    *   `PublishDate`: The timestamp of publication, crucial for any time-based analysis.
    *   `Topic`: The article's category (e.g., `economy`, `microsoft`).
    *   `Facebook`, `GooglePlus`, `LinkedIn`: The final popularity scores after a 48-hour period.

*   **Social Feedback Files (12 CSVs):** These files contain the granular, time-series popularity data.
    *   Their structure (`Platform_Topic.csv`) separates data by both social media platform and news topic.
    *   Their most defining characteristic is their **"wide" data format**, where columns `TS1` through `TS144` represent the popularity score at sequential 20-minute intervals.

## **A0.2. The Core Challenge: Data in an Unusable State**

In its raw state, this dataset is **analytically hostile**. It is impossible to draw meaningful, multi-dimensional conclusions without a *significant and deliberate preparation* phase. Our project will be the story of overcoming **three fundamental challenges**:

*   **Challenge 1: Data Fragmentation.** The insights are scattered across 13 separate files. It's impossible to compare platform performance or topic trends without first integrating this data into a single, unified structure.

*   **Challenge 2: Structural Unsuitability.** The "wide" format of the feedback files, with 144 columns for time, is optimized for storage, not analysis. We cannot plot a trend over time or compare engagement lifecycles with this structure.

*   **Challenge 3: Data Integrity Issues.** Several critical features are in an unusable format. The `PublishDate` is stored as text, making temporal analysis impossible. Furthermore, popularity scores use `-1` as a placeholder for missing data, which would severely distort any statistical calculation (e.g., the average).

⇒ The central premise of our project is that this raw data, while rich in potential, **actively prevents insight**. Our data preparation work is the key that will unlock this potential.

## **A0.3. Guiding Analytical Questions**

To guide our exploration and structure our data story, we will focus on answering a set of high-level strategic questions. These questions have been specifically chosen because they are **impossible to answer with the raw data** and can only be addressed after a comprehensive preparation pipeline.

1.  **The Temporal Question:** How does the **time of day** and **day of the week** an article is published influence its **engagement trajectory** on different social media platforms?

2.  **The Platform Personality Question:** Do different social media platforms exhibit **unique engagement "personalities"**? For instance, does LinkedIn engagement peak during business hours while Facebook's peaks during evenings and weekends?

3.  **The Content Lifecycle Question:** Is there a discernible difference in the popularity lifecycle between **"hard news"** topics (e.g., `Economy`, `Palestine`) and **"corporate/political"** topics (e.g., `Microsoft`, `Obama`)? Do some topics burn brightly but fade quickly, while others have a longer "tail" of engagement?

4.  **The Engagement Velocity Question:** Which platform provides the fastest initial **"lift"** or engagement velocity within the first few hours of an article's publication?

## **A0.4. Why These Questions? The Link to Data Preparation**

These questions are not arbitrary; they directly map to the **specific technical challenges** we must overcome. Each preparation task is a necessary step that **enables a specific type of analysis**, making our guiding questions answerable.

*   **To investigate the impact of time (Question 1):**
    *   **We must** convert the `PublishDate` column from a text `object` into a machine-readable **`datetime` format**.
    *   **We must** then **engineer new features** from this `datetime` object, such as `hour_of_day` and `day_of_week`.
    *   ⇒ **This enables** us to group and aggregate our data by time intervals—a capability that is completely absent in the raw data.

*   **To compare platform performance (Question 2):**
    *   **We must** **integrate the 13 fragmented files** into a single, cohesive master dataset.
    *   During this process, we will create a new `Platform` column to label the source of each popularity record.
    *   ⇒ **This enables** direct, side-by-side analysis of platform behavior, which is impossible when the data is siloed.

*   **To analyze engagement lifecycles and velocity (Questions 3 & 4):**
    *   **We must** perform a **structural wide-to-long transformation** on the time-series data (using `pd.melt`).
    *   This reshapes the 144 `TS` columns into a single, usable `TimeSlice` variable.
    *   ⇒ **This enables** us to plot popularity as a function of time, allowing us to visualize and measure the "lifecycle" and "velocity" of engagement. This is the most critical transformation in our project.

⇒ Our data preparation is therefore a **targeted strategy**. The quality of our final story is directly dependent on how well we execute these transformations to make answering our guiding questions **possible**.