Skip to content

Analysis of PixStory social media data combined with Snapchat, COVID-19, and YouTube data. This project uses the Apache Tika Clustering software to cluster certain social media posts together.

Notifications You must be signed in to change notification settings

todd-gavin/DSCI550-PixstoryDataAnalysis

Repository files navigation

DSCI550-PixstoryDataAnalysis

Collaborators: Jai Agrawal, Daniil Abbruzzese, Todd Gavin, Tania Dawood

1. Project Title:

Effects of the Pandemic on Media Usage and Consumption

2. Project Description:

We have compiled a large dataset containing data provided from Pixstory, as well as additional data sources of various MIME types to perform an analysis on how the COVID pandemic has affected people's usage of social media.

3. Implementation

The additional datasets all have detailed notebooks on their use, running each notebook file should give the outputs required. The dataset directories follow Dataset [num]_[title] formats. Further, the additional columns that needed to be added are also in detailed notebooks within directories. These directories follow [letter]_[col name] formats. The ‘1_Apache_Tika_Analysis’ directory contains detailed instructions how to run the Tika similarity like we did. The “Report_Questions” directory has separate .py files which were used to make visualizations used in the report to illustrate certain findings.

4. Default Features Added

Sports Datasets:

Film Festivals:

Hate Speech Dataset:

Sarcasm Dataset:

5. Additional Datasets Added

Additional Dataset #1 - Snapchat Daily Average Users Data:

MIME Type: test/CSV

Usage: To compare the number of snapchats DAU's to the daily posts on Pixstory to explore our research question. This dataset includes date, Snapchat's Daily Average Users and Revenue, Snapchat Stock data (Open, High, Low, Close, Adjacent Close and Volume)

In order to create this dataset, we obtained Snachat’s quarterly revenue and daily average users from Statista and its daily stock value from Yahoo Finance. Our data ranges from 2020-2022 as that is the range of data we have available for the Pixstory dataset. For the Snapchat dataset, the features we will be using for analysis are:

  • Feature 1: Daily Average Users
  • Feature 2: Stock Price
  • Feature 3: Revenue

Source:

Additional Dataset #2 - Daily COVID Data:

MIME Type: application/JSON

Usage: To keep track of the number of daily covid cases, deaths and vaccinations to see how these correlate to the number of Pixstory posts, Snapchat DAU's and number of likes/ views on the daily trending YouTube videos. This dataset includes the date, number of deaths, number of cases and number of vaccinations.

  • Feature 1: New daily deaths due to COVID in India
  • Feature 2: New daily COVID cases in India
  • Feature 3: New vaccinations against COVID in India

Note: this data set only had data available as early as 1/15/2021, which listed India as having 0 vaccinations against COVID. Because the Pixstory data set starts 1/12/2020, we decided to and this missing dates and input 0's for all vaccinations. This was justified because according to this data set India hadn't had any vaccinations until 1/16/2021.

Source:

Additional Dataset #3 - YouTube Daily Trending Videos:

MIME Type: Video/ MP4

Usage: Similar to the Snapchat DAU dataset, we wanted to see if there was any correlation between the number of likes and views to Pixstory posts and COVID cases. This dataset includes data on video ID, title published at, channel ID, channel title, category ID trending date, tags, view count, likes, dislikes and comment count

This is a dataset of the top trending videos on YouTube on any particular day. The MIME Type of this dataset is Video/ MP4. The data ranges from 2020 - 2022 and the features of this include:

  • Feature #1: highest trending video name,
  • Feature #2: highest trending video channel,
  • Feature #3: highest trending video category,
  • Feature #4: highest trending video views,
  • Feature #5: highest trending video likes

Source:

Note: all datasets were combined using the date column.

About

Analysis of PixStory social media data combined with Snapchat, COVID-19, and YouTube data. This project uses the Apache Tika Clustering software to cluster certain social media posts together.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •