# A4 Final Project

## Motivation and problem statement

I am planning on analyzing Twitch data and Google Trends to understand how users discover content and remain on a platform. I am interested in examining a synchronous platform such as Twitch and view how individuals find themselves engaged, specifically by continuing to view or return to a streamer. From a human-centered perspective, it will be interesting to attempt to learn what how our patterns vary in searches and watchtime, if determinable, though this may be difficult. From a scientific perspective, it may be useful to compare effort (in stream time) with results (of watch time) and gather insights around this. I'm hoping to learn more about how we discover synchronous platforms (exact search terms versus general), in this case a specific streamer's channel, and how engaged we stay on these platforms.

Originally I planned to only examine one dataset, the Top Streamers on Twitch, but found that my goals and end result may be rather limited in this regard. My idea was not scoped much as I simply considered a potential dataset to explore. In speaking to a peer, I was pushed to consider the reasoning behind this data generally or how I might use this data to represent something, such as the significance viewership or watch time rather than simply comparing the Twitch data to itself. I later aimed to tie with another dataset which became the Google Trends data.

## Data selected for analysis

With the [Top Streamers on Twitch](https://www.kaggle.com/datasets/aayushmishra1512/twitchdata) and [Google Trends](https://trends.google.com/trends/?geo=US) data, I believe I have a strong set of large data to understand search patterns with watch times and viewer count to discern possible patterns. While there are many external factors in maintaining viewership, I believe there may be some correlation with search patterns and discovering, rediscovering, or simply routing to certain platforms. 

[Top Streamers on Twitch](https://www.kaggle.com/datasets/aayushmishra1512/twitchdata) contains a Twitch channel name, watch time, stream time, peak viewers, average viewers, followers, followers gained, views gained, whether a channel is partnered with Twitch, and if a channel is content-restricted being mature for those 18+. This data is scraped and represents this information over one year. This dataset is under the CC0: Public Domain license. The data can be used to draw connections with search queries to try and identify any patterns or correlations with values such as watch time, follower gain, or viewership. Usage of this dataset presents no ethical considerations in the context of this project but data may be limited or skewed by other factors such as viewbotting.

[Google Trends](https://trends.google.com/trends/?geo=US) allows one to search for a search query such as Twitch or a channel name and returns information such as interest over time on a scale from 0 to 100, interest by subregion and location, related topics, and related search queries. Data is sortable by customizing time ranges which can match up with the year that the top streamers data was scraped and collected. When using Trends data, one must attribute the information to Google with a citation. For example, add "Data source: Google Trends (https://www.google.com/trends)." The data can be used to draw connections with Twitch stream data to understand if search queries have an effect on stream data. No ethical issues are foreseen in the context of this project.

## Unknowns and dependencies

With these datasets, I acknowledge the results of this project do not fully detail patterns as there are other variables not controlled such as type of stream, content, connectivity, or scheduling (for both viewers and streamers). In this regard, the results will likely be limited. While a minimum viable product detailing discoveries can be shared by the end of the quarter, it may not be fully flushed out due to the time constraint of four weeks. Additionally, these datasets are in CSV format only which means the data may not be as complex but is easy to read and process. Finally, the Twitch dataset heavily relies on Kaggle which may contain accurate or inaccurate information and as a result, reliability may be shaky and lead to hesitation about results. With this in mind, it is important to review the findings as potential patterns or correlation, but understand it may not be entirely accurate.

.

## Research questions

#### How might search results impact watch times and viewer count of a streamer?

This question aims to investigate if search results a streamer's channel. This question begins by analyzing search results and identifying if there is a pattern between the results and viewership.

#### What is the relationship between the popularity of a streamer and frequency of searches for their alias?

While similar to the previous question, instead this question begins analyzing streamer popularity and attempts to map their popularity to search queries.

#### How might search results impact follower counts of a streamer?

Again, similar to the first question, this question aims to understand how search results impact a streamer's channel, but more specifically if users choose to follow said streamer.

## Background and Related Work

Using the Twitch API, [Tomaz Bratanic has investigated streamers and their content](https://towardsdatascience.com/twitchverse-a-network-analysis-of-twitch-universe-using-neo4j-graph-data-science-d7218b4453ff). While not directly related to my analysis, there is helpful information around Twitch and streamers, who are a focal point of my analysis. This previous study helps set further foundation and context as I move forward.

Using the same streamer dataset I will be examining, [Jagannath Pal has investigated the relationship between streamer data such as between streaming time and followers gained](https://www.kaggle.com/code/jagannathpal/twitch-streamer-analysis-eda-prediction#Exploratory-Data-Analysis) which are values I too am exploring. [Lander Iturregi](https://www.kaggle.com/code/landeriturregi/twitch-streamers-2020-data-analysis) has also performed some analysis around Twitch to identify top streamers and general population data about streamers. These explorations will allow me to narrow my scope into select streamers, demographics, or types of content that I wish to pursue within my explorations.

While there is not public analysis directly related to my investigation of choice, these previous explorations empower me to pursue certain streamers, for example Lander's analysis shows that Tfue, shroud, and Myth were the top three streamers with the most followers in 2020 and may be worth analyzing. Tomaz's analysis similarly shows Tfue and shroud, but instead of Myth his data supports that rubius was the third most followed streamer in 2021. Tomaz's exploration also breaks down various Twitch categories and shows the five most popular being Just Chatting, Resident Evil Village, Grand Theft Auto V, League of Legends, and Fortnite at the time of research. Understanding various streamers, their specialties, and their chosen games to broadcast is important in understanding whether or not search queries and viewership has a significant correlation.

Previous research primarily informs my exploration in helping me select data to analyze as clearly there are too many streamers to sort through and examine the Google search trends for all of them. As a result, I will narrow my scope in some of the larger, more well known streamers that will likely have search queries for them as their Twitch channel, or alias, is fairly known and show the most relevant results (for instance, "Myth" may be a word, but there is enough popularity where streamer results also appear).

## Methodology

To begin investigating this data, I will narrow down to identify and select streamers and their aliases, or channel names, to work with. From this, I will then use the Google Trends dataset to view patterns of their alias being searched. I will also examine streamer data such as follower count and viewership for each alias.

I plan to begin with an ordinary least squares regression to understand if there is a relationship between search queries and streamer popularity, where popularity may be defined as their follower count or viewership. I intend to examine both. This method allows me to identify if there is or is not a potential relationship between my variables before I dive deeper into what that relationship might be.

As I analyze my data, I will refer back to my research questions to understand if there are any clear answers and if there is further investigation needed. Depending on those findings, I may choose to gather more samples (more streamers), narrow my scope, or broaden my scope.

Moving onwards, I will organize my findings in a table for visibility, but push towards data visualizations in the form of graphs or charts to showcase relationships or the lack of a relationship between my data. This will empower viewers to easily scan and comprehend my research findings. A simple table is helpful to view multiple datasets together, for instance multiple streamers with their search query trends, follower counts, and viewership. I could also showcase all of my streamer data in one graph, but this would be cluttered, noisy, and overall difficult to comprehend. As a result, I am to separate each streamer I analyze into a unique graph for data visualization that can showcase a potential relationship or lack of one.