# Final project guidelines

**Note:** Use these guidelines if and only if you are pursuing a **final project of your own design**. For those taking the final exam instead of the project, see the (separate) final exam notebook.

## Guidelines

These guidelines are intended for **undergraduates enrolled in INFO 3350**. If you are a graduate student enrolled in INFO 6350, you're welcome to consult the information below, but you have wider latitude to design and develop your project in line with your research goals.

### The task

Your task is to: identify an interesting problem connected to the humanities or humanistic social sciences that's addressable with the help of computational methods, formulate a hypothesis about it, devise an experiment or experiments to test your hypothesis, present the results of your investigations, and discuss your findings.

These tasks essentially replicate the process of writing an academic paper. You can think of your project as a paper in miniature.

You are free to present each of these tasks as you see fit. You should use narrative text (that is, your own writing in a markdown cell), citations of others' work, numerical results, tables of data, and static and/or interactive visualizations as appropriate. Total length is flexible and depends on the number of people involved in the work, as well as the specific balance you strike between the ambition of your question and the sophistication of your methods. But be aware that numbers never, ever speak for themselves. Quantitative results presented without substantial discussion will not earn high marks. 

Your project should reflect, at minimum, ten **or more** hours of work by each participant, though you will be graded on the quality of your work, not the amount of time it took you to produce it. Most high-quality projects represent twenty or more hours of work by each member.

#### Pick an important and interesting problem!

No amount of technical sophistication will overcome a fundamentally uninteresting problem at the core of your work. You have seen many pieces of successful computational humanities research over the course of the semester. You might use these as a guide to the kinds of problems that interest scholars in a range of humanities disciplines. You may also want to spend some time in the library, reading recent books and articles in the professional literature. **Problem selection and motivation are integral parts of the project.** Do not neglect them.

### Format

You should submit your project as a Jupyter notebook, along with all data necessary to reproduce your analysis. If your dataset is too large to share easily, let us know in advance so that we can find a workaround. If you have a reason to prefer a presentation format other than a notebook, likewise let us know so that we can discuss the options.

Your report should have four basic sections (provided in cells below for ease of reference):

1. **Introduction and hypothesis.** What problem are you working on? Why is it interesting and important? What have other people said about it? What do you expect to find?
2. **Corpus, data, and methods.** What data have you used? Where did it come from? How did you collect it? What are its limitations or omissions? What major methods will you use to analyze it? Why are those methods the appropriate ones?
3. **Results.** What did you find? How did you find it? How should we read your figures? Be sure to include confidence intervals or other measures of statistical significance or uncetainty where appropriate.
4. **Discussion and conclusions.** What does it all mean? Do your results support your hypothesis? Why or why not? What are the limitations of your study and how might those limitations be addressed in future work?

Within each of those sections, you may use as many code and markdown cells as you like. You may, of course, address additional questions or issues not listed above.

All code used in the project should be present in the notebook (except for widely-available libraries that you import), but **be sure that we can read and understand your report in full without rerunning the code**. Be sure, too, to explain what you're doing along the way, both by describing your data and methods and by writing clean, well commented code.

### Grading

This project takes the place of the take-home final exam for the course. It is worth 35% of your overall grade. You will be graded on the quality and ambition of each aspect of the project. No single component is more important than the others.

### Practical details

* The project is due at **noon on Saturday, December 9** via upload to CMS of a single zip file containing your fully executed Jupyter notebook and all associated data.
* You may work alone or in a group of up to three total members.
    * If you work in a group, be sure to list the names of the group members.
    * For groups, create your group on CMS and submit one notebook for the entire group. **Each group should also submit a statement of responsibility** that describes in general terms who performed which parts of the project.
* You may post questions on Ed, but should do so privately (visible to course staff only).
* Interactive visualizations do not always work when embedded in shared notebooks. If you plan to use interactives, you may need to host them elsewhere and link to them.

---

## Your info
* NetID(s): sc2548, jsc342
* Name(s): Stephy Chen, Joyce Chen
---

## Brainstorm Ideas for Project ##

1. Examining lyrics of songs from varying genres, artists, and etc. (possibly taken from billboard, top 100 charts, etc) to see whether they reflect popular trends during that period of time either based on the current events occuring (may include poltically or socially motivated events)  --

2. Examining how authors (perhaps classical authors from a certain range of time/genre) used adjectives/certain connotative words/phrases to describe gender? (Could we predict information about the author through these used phrases)

3. Examining song lyrics from varying genres/artists to determine how they describe gender. Seeing whether the language used or connotation of their words reflect a trend in the type of artists that sang or wrote the lyrics.

4. Examining media and news outlets diction in describing current events relating to political partisanship (whether their words indicate the general party they're leaning towards)
   - Taking at whether the words used to describe certain events are more comparably positive/negative in connotation
   - Seeing whether there is a distinction between the type of words used to describe current events by left/right wing
   - implications: 
     - Being aware of echo chambers 
     - Recommending news outlets/media that are more neutral based
     - Indicating which news outlets/media are more biased towards a side
     - Is there a correlation with the recent political polarization 
https://www.mediacloud.org/media-cloud-directory 

5. Examine reddit and discussion forums to understand incel culture/crimes against women descriptions to see 


FINAL DECISION: CHOICE 4 

---


## 1. Introduction

In an era marked by unprecedented political polarization, the focus on media and its role in shaping public perception has never been more critical. This data project delves into the intricate web of language, word choice, and stylistic choices employed by various news outlets when reporting on current events, with a keen emphasis on political partisanship.

The importance of this investigation lies in its potential to uncover implicit biases within media narratives and disseminated information. Firstly, we scrutinize the diction and connotations used to describe events among news outlets, seeking to discern patterns that determine whether their choice of words or coverage of events indicates a leaning towards a particular political party. Simultaneously, we conduct a comparative analysis of language used by different media outlets (across the political spectrum) when covering the same events, unraveling distinct narratives crafted by left-wing and right-wing sources to evaluate the implications of these differing perspectives. Secondly, the exploration extends beyond mere observation, delving into whether certain events are portrayed with a comparably positive or negative bias, contributing to the broader discourse on media objectivity.

The significance of this research is far-reaching and transcends academic curiosity. It holds profound implications for societal awareness and media literacy. By making citizens aware of potential echo chambers in media consumption, the project aims to empower individuals to critically evaluate the information they receive. Beyond mere awareness, our project aspires to offer recommendations for news outlets and media that demonstrate a more neutral stance, allowing consumers to make informed choices about their news sources.

Existing literature reviews have already shed light on the political slant present in many media sources (the picture below from AllSides is widely used when looking at the media outlets). This project builds upon this foundation, seeking not only to confirm these biases but also to provide a nuanced understanding of the language that perpetuates them. As political polarization continues to shape the socio-political landscape, this research contributes to the ongoing dialogue by exploring the correlation between media language and the evolving dynamics of political partisanship.

![image](pictures/all-sides.jpeg)

https://www.allsides.com/media-bias/media-bias-chart 

https://www.pewresearch.org/journalism/2014/10/21/section-1-media-sources-distinct-favorites-emerge-on-the-left-and-right/


## 2. Research Question and Hypothesis

### Hypothesis

We hypothesize that there is a correlation between the language and style employed by news outlets in reporting current events and political partisanship. 

(if specificity is needed, here is a version: We hypothesize that there is a positive correlation between the language and style employed by news outlets in reporting current events and political partisanship. Specifically, we anticipate that news outlets leaning towards a particular political party will use language and style that align with the ideologies of that party.)

If would be better for us to focus on a specific topic. 
Single issue (couple of hundred of articles) about two outlets. 

### Research Question 

Do news outlets' language and style in reporting current events correlate with political partisanship? How much do these language choices link to the perceived positivity or negativity of specific events?

### Selected News Outlets 

Out of all these outlets seen in the AllSides graph, we will choose 3 from each category to analyze the articles published. (Should we have a separate section where we specifically pick articles from different sources that are convering the same events or is that included within the 3 that we choose from each category?)

#### Left 
- CNN 
- Buzzfeed News 
- HuffPost 
- MSNBC 
- The New Yorker 

#### Left-Leaning 
- Bloomberg 
- CBS 
- NBC 
- New York Times News
- Washington Post 
- USA Today 

#### Center
- Wall Street Journal News 
- Reuters 
- Newsweek 
- BBC 
- Reuters 
- The Hill 

#### Right-Leaning 
- Epoch Times 
- The Washington Times 
- The Post Millennial 
- The American Conservative 
- The Dispatch 

#### Right
- Daily Mail 
- Daily Wire 
- Fox News 
- The Federalist 
- The American Spectator



OFFICIAL LIST

#### Left 
- CNN 

#### Left-Leaning 
- New York Times News

#### Center
- Wall Street Journal News (Optional)
- BBC 

#### Right-Leaning 
- The New York Post News

#### Right 
- Fox News 




## 2. Corpus & Data Cleaning

### Corpus Creation: DEADLINE (12/05)

In [10]:
import requests
from bs4 import BeautifulSoup


1. Scrape a ridiculous amount of articles from news outlets

2. Read relevant papers (on abortion and the work that has been done to analyze thus far) 
http://languagelog.ldc.upenn.edu/myl/Monroe.pdf
https://www.pewresearch.org/religion/fact-sheet/public-opinion-on-abortion/#CHAPTER-h-views-on-abortion-2021-a-detailed-look 

3. Standard of Comparison Creation Part I: Examine the partisan of congressional speeches to create the spectrum of words that indicate whether certain phrasing belongs to a certain party
https://data.stanford.edu/congress_text (scroll down on page to find multiple zip files)

4. Standard of Comparison Creation Part II: Examine presidential debates regarding abortion and use that as a standard to characterize the political stance on abortion (whether they use or have positive/negative views) 
https://www.debates.org/voter-education/debate-transcripts/   

- Finding similarity of the phrasings used between the standards created (backed up by scholarly articles) with the phrasings commonly found within news articles

  

## Data Scraping Presidential Elections

Let's scrape the raw text from presidential elections that mention abortions. 


In [4]:
##install requests

!pip install requests



In [13]:
#Scraped Data for Vice Presidential Election 2020 (Kamala Harris and Mike Pence)

URL = "https://debates.org/voter-education/debate-transcripts/vice-presidential-debate-at-the-university-of-utah-in-salt-lake-city-utah/" 
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
paragraphs_with_abortion = [p.get_text() for p in soup.find_all('p') if 'abortion' in p.get_text().lower()]

for paragraph in paragraphs_with_abortion:
    pres_debate_2020 = paragraph
    print(pres_debate_2020)

print(type(pres_debate_2020)) #class is a string


PAGE: Vice President Pence. Vice President Pence. I didn’t – Vice President Pence – I did not create the rules for tonight. Your campaigns agreed to the rules for tonight’s debate, with the Commission on Presidential Debates. I’m here to enforce them, which involves moving from one topic to another, giving roughly equal time to both of you, which is what I’m trying very hard to do. So I want to go ahead and move to the next topic, which is an important one, as the last topic was, and that is the Supreme Court. On Monday, the Senate Judiciary Committee is scheduled to open hearings on Amy Coney Barrett’s nomination to the Supreme Court. Senator Harris, you’ll be there as a member of the committee. Her confirmation would cement the court’s conservative majority, and make it likely open to more abortion restrictions, even to overturning the landmark Roe v Wade ruling. Access to abortion would then be up to the states. Vice President Pence, you’re the former governor of Indiana. If Roe v W

In [19]:
#Scraped Data for First Presidential Election 2016 (Hilary Clinton and Donald Trump) 

#October 4, 2016

URL = "https://www.debates.org/voter-education/debate-transcripts/october-4-2016-debate-transcript/" 
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
paragraphs_with_abortion = [p.get_text() for p in soup.find_all('p') if 'abortion' in p.get_text().lower()]

for paragraph in paragraphs_with_abortion:
    pres_debate_2016_1 = paragraph
    print(pres_debate_2016_1)

print(type(pres_debate_2016_1)) #class is a string


The state of Indiana has also sought to make sure that we expand alternatives in health care counseling for women, non-abortion alternatives. I’m also very pleased at the fact we’re well on our way in Indiana to becoming the most pro-adoption state in America. I think if you’re going to be pro-life, you should — you should be pro-adoption.
But what I can’t understand is with Hillary Clinton and now Senator Kaine at her side is to support a practice like partial-birth abortion. I mean, to hold to the view — and I know Senator Kaine, you hold pro-life views personally — but the very idea that a child that is almost born into the world could still have their life taken from them is just anathema to me.
And I cannot — I can’t conscience about — about a party that supports that. Or that — I know you’ve historically opposed taxpayer funding of abortion. But Hillary Clinton wants to — wants to repeal the longstanding provision in the law where we said we wouldn’t use taxpayer dollars to fund 

In [18]:
#Scraped Data for Second Presidential Election 2016 (Hilary Clinton and Donald Trump) 

#October 19, 2016

URL = "https://www.debates.org/voter-education/debate-transcripts/october-19-2016-debate-transcript/" 
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
paragraphs_with_abortion = [p.get_text() for p in soup.find_all('p') if 'abortion' in p.get_text().lower()]

for paragraph in paragraphs_with_abortion:
    pres_debate_2016_2 = paragraph
    print(pres_debate_2016_2)

print(type(pres_debate_2016_2)) #class is a string


WALLACE: Well, let’s pick up on another issue which divides you and the justices that whoever ends up winning this election appoints could have a dramatic effect there, and that’s the issue of abortion.
WALLACE: Mr. Trump, you’re pro-life. But I want to ask you specifically: Do you want the court, including the justices that you will name, to overturn Roe v. Wade, which includes—in fact, states—a woman’s right to abortion?
CLINTON: And we have come too far to have that turned back now. And, indeed, he said women should be punished, that there should be some form of punishment for women who obtain abortions. And I could just not be more opposed to that kind of thinking.
WALLACE: I’m going to give you a chance to respond, but I want to ask you, Secretary Clinton, I want to explore how far you believe the right to abortion goes. You have been quoted as saying that the fetus has no constitutional rights. You also voted against a ban on late-term, partial-birth abortions. Why?
CLINTON: Beca

## 3. Methods

### DEADLINE: TUE 12/5 (Midnight)

- Use Fightingwords to examine partisan difference in news outlets (FIGHTING WORDS LECTURE 10, use given library, used in PSET 3) - look at words 
- Use VADER on the presidential debates + congressional speech phrasings (VADER WAS FOUND IN PSET 1)
- Use (k-means) Clustering of phrasings (apply to all) (FOUND IN PSET 2, LECTURE 6, 7)
- Use BERT Classification (PSET 5, Lecture 15, 16, article discussed)
- When dealing with CENTER, we would use VADER to examine the connotation of words used within the outlet and we also compare that to the fightingwords identified for left and right and examine the distribution of those words within that outlet
- 

## 4. Results

### DEADLINE: THURS 12/7 (Midnight)

## 5. Discussion and conclusions

### DEADLINE: 12/8