# Final project guidelines

**Note:** Use these guidelines if and only if you are pursuing a **final project of your own design**. For those taking the final exam instead of the project, see the (separate) final exam notebook.

## Guidelines

These guidelines are intended for **undergraduates enrolled in INFO 3350**. If you are a graduate student enrolled in INFO 6350, you're welcome to consult the information below, but you have wider latitude to design and develop your project in line with your research goals.

### The task

Your task is to: identify an interesting problem connected to the humanities or humanistic social sciences that's addressable with the help of computational methods, formulate a hypothesis about it, devise an experiment or experiments to test your hypothesis, present the results of your investigations, and discuss your findings.

These tasks essentially replicate the process of writing an academic paper. You can think of your project as a paper in miniature.

You are free to present each of these tasks as you see fit. You should use narrative text (that is, your own writing in a markdown cell), citations of others' work, numerical results, tables of data, and static and/or interactive visualizations as appropriate. Total length is flexible and depends on the number of people involved in the work, as well as the specific balance you strike between the ambition of your question and the sophistication of your methods. But be aware that numbers never, ever speak for themselves. Quantitative results presented without substantial discussion will not earn high marks. 

Your project should reflect, at minimum, ten **or more** hours of work by each participant, though you will be graded on the quality of your work, not the amount of time it took you to produce it. Most high-quality projects represent twenty or more hours of work by each member.

#### Pick an important and interesting problem!

No amount of technical sophistication will overcome a fundamentally uninteresting problem at the core of your work. You have seen many pieces of successful computational humanities research over the course of the semester. You might use these as a guide to the kinds of problems that interest scholars in a range of humanities disciplines. You may also want to spend some time in the library, reading recent books and articles in the professional literature. **Problem selection and motivation are integral parts of the project.** Do not neglect them.

### Format

You should submit your project as a Jupyter notebook, along with all data necessary to reproduce your analysis. If your dataset is too large to share easily, let us know in advance so that we can find a workaround. If you have a reason to prefer a presentation format other than a notebook, likewise let us know so that we can discuss the options.

Your report should have four basic sections (provided in cells below for ease of reference):

1. **Introduction and hypothesis.** What problem are you working on? Why is it interesting and important? What have other people said about it? What do you expect to find?
2. **Corpus, data, and methods.** What data have you used? Where did it come from? How did you collect it? What are its limitations or omissions? What major methods will you use to analyze it? Why are those methods the appropriate ones?
3. **Results.** What did you find? How did you find it? How should we read your figures? Be sure to include confidence intervals or other measures of statistical significance or uncetainty where appropriate.
4. **Discussion and conclusions.** What does it all mean? Do your results support your hypothesis? Why or why not? What are the limitations of your study and how might those limitations be addressed in future work?

Within each of those sections, you may use as many code and markdown cells as you like. You may, of course, address additional questions or issues not listed above.

All code used in the project should be present in the notebook (except for widely-available libraries that you import), but **be sure that we can read and understand your report in full without rerunning the code**. Be sure, too, to explain what you're doing along the way, both by describing your data and methods and by writing clean, well commented code.

### Grading

This project takes the place of the take-home final exam for the course. It is worth 35% of your overall grade. You will be graded on the quality and ambition of each aspect of the project. No single component is more important than the others.

### Practical details

* The project is due at **noon on Saturday, December 9** via upload to CMS of a single zip file containing your fully executed Jupyter notebook and all associated data.
* You may work alone or in a group of up to three total members.
    * If you work in a group, be sure to list the names of the group members.
    * For groups, create your group on CMS and submit one notebook for the entire group. **Each group should also submit a statement of responsibility** that describes in general terms who performed which parts of the project.
* You may post questions on Ed, but should do so privately (visible to course staff only).
* Interactive visualizations do not always work when embedded in shared notebooks. If you plan to use interactives, you may need to host them elsewhere and link to them.

---

## Your info
* NetID(s): sc2548, jsc342
* Name(s): Stephy Chen, Joyce Chen
---

## Brainstorm Ideas for Project ##

1. Examining lyrics of songs from varying genres, artists, and etc. (possibly taken from billboard, top 100 charts, etc) to see whether they reflect popular trends during that period of time either based on the current events occuring (may include poltically or socially motivated events)  --

2. Examining how authors (perhaps classical authors from a certain range of time/genre) used adjectives/certain connotative words/phrases to describe gender? (Could we predict information about the author through these used phrases)

3. Examining song lyrics from varying genres/artists to determine how they describe gender. Seeing whether the language used or connotation of their words reflect a trend in the type of artists that sang or wrote the lyrics.

4. Examining media and news outlets diction in describing current events relating to political partisanship (whether their words indicate the general party they're leaning towards)
   - Taking at whether the words used to describe certain events are more comparably positive/negative in connotation
   - Seeing whether there is a distinction between the type of words used to describe current events by left/right wing
   - implications: 
     - Being aware of echo chambers 
     - Recommending news outlets/media that are more neutral based
     - Indicating which news outlets/media are more biased towards a side
     - Is there a correlation with the recent political polarization 
https://www.mediacloud.org/media-cloud-directory 

5. Examine reddit and discussion forums to understand incel culture/crimes against women descriptions to see 


FINAL DECISION: CHOICE 4 

---


## 1. Introduction

In an era marked by unprecedented political polarization, the focus on media and its role in shaping public perception has never been more critical. This data project delves into the intricate web of language, word choice, and stylistic choices employed by various news outlets when reporting on current events, with a keen emphasis on political partisanship.

The importance of this investigation lies in its potential to uncover implicit biases within media narratives and disseminated information. Firstly, we scrutinize the diction and connotations used to describe events among news outlets, seeking to discern patterns that determine whether their choice of words or coverage of events indicates a leaning towards a particular political party. Simultaneously, we conduct a comparative analysis of language used by different media outlets (across the political spectrum) when covering the same events, unraveling distinct narratives crafted by left-wing and right-wing sources to evaluate the implications of these differing perspectives. Secondly, the exploration extends beyond mere observation, delving into whether certain events are portrayed with a comparably positive or negative bias, contributing to the broader discourse on media objectivity.

The significance of this research is far-reaching and transcends academic curiosity. It holds profound implications for societal awareness and media literacy. By making citizens aware of potential echo chambers in media consumption, the project aims to empower individuals to critically evaluate the information they receive. Beyond mere awareness, our project aspires to offer recommendations for news outlets and media that demonstrate a more neutral stance, allowing consumers to make informed choices about their news sources.

Existing literature reviews have already shed light on the political slant present in many media sources (the picture below from AllSides is widely used when looking at the media outlets). This project builds upon this foundation, seeking not only to confirm these biases but also to provide a nuanced understanding of the language that perpetuates them. As political polarization continues to shape the socio-political landscape, this research contributes to the ongoing dialogue by exploring the correlation between media language and the evolving dynamics of political partisanship.

![image](pictures/all-sides.jpeg)

https://www.allsides.com/media-bias/media-bias-chart 

https://www.pewresearch.org/journalism/2014/10/21/section-1-media-sources-distinct-favorites-emerge-on-the-left-and-right/


## 2. Research Question and Hypothesis

### Hypothesis

We hypothesize that there is a correlation between the language and style employed by news outlets in reporting current events and political partisanship. 

(if specificity is needed, here is a version: We hypothesize that there is a positive correlation between the language and style employed by news outlets in reporting current events and political partisanship. Specifically, we anticipate that news outlets leaning towards a particular political party will use language and style that align with the ideologies of that party.)

If would be better for us to focus on a specific topic. 
Single issue (couple of hundred of articles) about two outlets. 

### Research Question 

Do news outlets' language and style in reporting current events correlate with political partisanship? How much do these language choices link to the perceived positivity or negativity of specific events?

### Selected News Outlets 

Out of all these outlets seen in the AllSides graph, we will choose 3 from each category to analyze the articles published. (Should we have a separate section where we specifically pick articles from different sources that are convering the same events or is that included within the 3 that we choose from each category?)

#### Left 
- CNN 
- Buzzfeed News 
- HuffPost 
- MSNBC 
- The New Yorker 

#### Left-Leaning 
- Bloomberg 
- CBS 
- NBC 
- New York Times News
- Washington Post 
- USA Today 

#### Center
- Wall Street Journal News 
- Reuters 
- Newsweek 
- BBC 
- Reuters 
- The Hill 

#### Right-Leaning 
- Epoch Times 
- The Washington Times 
- The Post Millennial 
- The American Conservative 
- The Dispatch 

#### Right
- Daily Mail 
- Daily Wire 
- Fox News 
- The Federalist 
- The American Spectator



OFFICIAL LIST

#### Left 
- CNN 

#### Left-Leaning 
- New York Times News

#### Center
- Wall Street Journal News (Optional)
- BBC 

#### Right-Leaning 
- The New York Post News

#### Right 
- Fox News 




## 2. Corpus & Data Cleaning

### Corpus Creation: DEADLINE (12/05)

In [6]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd



1. Scrape a ridiculous amount of articles from news outlets

2. Read relevant papers (on abortion and the work that has been done to analyze thus far) 
http://languagelog.ldc.upenn.edu/myl/Monroe.pdf
https://www.pewresearch.org/religion/fact-sheet/public-opinion-on-abortion/#CHAPTER-h-views-on-abortion-2021-a-detailed-look 

3. Standard of Comparison Creation Part I: Examine the partisan of congressional speeches to create the spectrum of words that indicate whether certain phrasing belongs to a certain party
https://data.stanford.edu/congress_text (scroll down on page to find multiple zip files)

4. Standard of Comparison Creation Part II: Examine presidential debates regarding abortion and use that as a standard to characterize the political stance on abortion (whether they use or have positive/negative views) 
https://www.debates.org/voter-education/debate-transcripts/   

- Finding similarity of the phrasings used between the standards created (backed up by scholarly articles) with the phrasings commonly found within news articles

  

## Data Scraping News Outlet Articles

### CNN Articles (51)

In [17]:
import pandas as pd

def read_txt_to_dataframe(file_path):
    # Initialize empty lists to store data
    news_outlets = []
    titles = []
    authors = []
    publication_dates = []
    article_contents = []

    # Open the text file
    with open(file_path, 'r') as file:
        lines = file.readlines()

        # Initialize variables to store information
        current_article = {}
        
        # Initialize list to store dictionaries
        articles_data = []
        
        # Iterate through lines in the file
        for line in lines:
            # Split the line into key and value if possible
            line_parts = line.split(':', 1)
            
            # Check if the line can be split into key and value
            if len(line_parts) == 2:
                key, value = map(str.strip, line_parts)
                
                # Check for the end of an article
                if key == 'Article_Content':
                    # Save the current article information
                    current_article['Article_Content'] = value.strip()
                    
                    articles_data.append({
                        'News_Outlet': current_article.get('News_Outlet', ''),
                        'Title': current_article.get('Title', ''),
                        'Author': current_article.get('Author', ''),
                        'Publication_Date': current_article.get('Publication_Date', ''),
                        'Article_Content': current_article.get('Article_Content', '')
                    })

                    current_article = {}
                else:
                    # Add key-value pair to current article dictionary
                    current_article[key] = value

    df = pd.DataFrame(articles_data)

    return df

txt_file_path = 'CNN.txt'

cnn_data_frame = read_txt_to_dataframe(txt_file_path)

cnn_data_frame.head(5)



Unnamed: 0,News_Outlet,Title,Author,Publication_Date,Article_Content
0,CNN,"Abortion is ancient history: Long before Roe, ...",Katie Hunt,"Published 7:29 AM EDT, Fri June 23, 2023","CNN — Abortion today, at least in the United ..."
1,CNN,Another state passed a near-total abortion ban...,Zachary B. Wolf,"Updated 2:13 AM EDT, Wed September 14, 2022",CNN — Good luck trying to keep on top of the ...
2,CNN,Myths about abortion and women’s mental health...,Sandee LaMotte,"Updated 11:12 PM EDT, Mon July 11, 2022",CNN — It’s an unfounded message experts say i...
3,CNN,Survey finds widespread confusion around medic...,Deidre McPhillips,"Published 6:13 AM EST, Wed February 1, 2023",CNN — Nearly half of adults in the United Sta...
4,CNN,Births have increased in states with abortion ...,Deidre McPhillips,"Published 1:22 PM EST, Tue November 21, 2023",CNN — Nearly a quarter of people seeking an ...


In [22]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import fightinwords as fw

# Extract 'Article_Content' column 
article_contents = cnn_data_frame['Article_Content'].values

# Vectorize using CountVectorizer
vec = CountVectorizer(
    lowercase=True,
    strip_accents='unicode',
    input='content',
    encoding='utf-8',
    stop_words='english'
)
X = vec.fit_transform(article_contents)

# Calculate Fighting Words values
fighting_words_results = fw.bayes_compare_language(
    l1=np.where(np.ones(len(article_contents)))[0],  # indices for all articles
    l2=[],  # empty list since we're comparing language within the same set
    features=X,
    ngram=1,
    prior=0.01,
    prior_weight=None,
    cv=vec,
    vocab=list(vec.get_feature_names_out())
)

# Output the results
def fw_score(data, n=10):
    def print_top_terms(data):
        print(f'Top terms')
        for term, score in data:
            print(f'{term}\t{score}')

    print_top_terms(reversed(data[-n:]))
    print('\n')
    print_top_terms(data[:n])

fw_score(fighting_words_results)


Vocab size is 6043
Comparing language...
Top terms
abortion	0.53810458372724
said	0.4567316288827305
states	0.40137003019003803
women	0.4010128040413933
state	0.3912555396715251
people	0.37129953244701064
court	0.36738965079280955
health	0.36332280401687467
pregnancy	0.3606955093391477
law	0.35634462719211507


Top terms
joke	-0.16228342705665835
logistical	-0.16228342705665835
longterm	-0.16228342705665835
longtime	-0.16228342705665835
lookout	-0.16228342705665835
loom	-0.16228342705665835
looming	-0.16228342705665835
loosening	-0.16228342705665835
lord	-0.16228342705665835
losses	-0.16228342705665835


## Data Scraping Presidential Elections

Let's scrape the raw text from presidential elections that mention abortions. 


In [35]:
##install requests

!pip install requests



In [1]:
!pip install nltk


Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
Installing collected packages: nltk
Successfully installed nltk-3.8.1


In [7]:
#Scraped Data for Vice Presidential Election 2020 (Kamala Harris and Mike Pence)

URL = "https://debates.org/voter-education/debate-transcripts/vice-presidential-debate-at-the-university-of-utah-in-salt-lake-city-utah/" 
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
all_paragraphs = soup.find_all('p')

# Find the indices of paragraphs containing the word "abortion"
abortion_indices = [index for index, p in enumerate(all_paragraphs) if 'abortion' in p.get_text().lower()]

if abortion_indices:
    # Extract paragraphs between the first and last mention of "abortion"
    start_index = abortion_indices[0]
    end_index = abortion_indices[-1]
    
    paragraphs_between_abortion = all_paragraphs[start_index:end_index + 1]

    # Create a DataFrame with columns for 'name', 'dem/rep', and 'content'
    data = {'name': [], 'year': [], 'dem/rep': [], 'left/right': [], 'content': []}
    count_pence = 0

    for paragraph in paragraphs_between_abortion:
        count_pence += paragraph.get_text().count('PENCE:')
        
        if count_pence == 4:
            # Modify the paragraph in-place if it's the fourth instance of 'PENCE:'
            paragraph.string = re.sub(r'^PENCE:', 'HARRIS:', paragraph.get_text())
    
        # Skip paragraphs that start with "PAGE:"
        if not paragraph.get_text().startswith("PAGE:"):
            # Determine 'name' and 'dem/rep' based on the speaker
            if 'PENCE:' in paragraph.get_text():
                data['name'].append('PENCE')
                data['year'].append(2020)
                data['dem/rep'].append('rep')
                data['left/right'].append('right')
            else:
                data['name'].append('HARRIS')
                data['year'].append(2020)
                data['dem/rep'].append('dem')
                data['left/right'].append('left')
            
            data['content'].append(re.sub(r'^PENCE:|HARRIS:', '', paragraph.get_text()))

    prez_df_2020 = pd.DataFrame(data)

    # Print the DataFrame
    print(prez_df_2020)
    print(prez_df_2020.shape)
        

     name  year dem/rep left/right  \
0   PENCE  2020     rep      right   
1   PENCE  2020     rep      right   
2   PENCE  2020     rep      right   
3  HARRIS  2020     dem       left   
4   PENCE  2020     rep      right   

                                             content  
0   Well thank you for the question, but I’ll use...  
1   My hope is that when the hearing takes place,...  
2   – treated respectfully and voted and confirme...  
3   Thank you, Susan. First of all, Joe Biden and...  
4   Well, thank you, Susan. Let me just say, addr...  
(5, 5)


In [8]:
#Scraped Data for Vice Presidential Election 2016 (Mike Pence and Tim Kaine)

#October 4, 2016

URL = "https://www.debates.org/voter-education/debate-transcripts/october-4-2016-debate-transcript/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
all_paragraphs = soup.find_all('p')

# Find the index of the paragraph with the last mention of "abortion"
last_abortion_index = max([index for index, p in enumerate(all_paragraphs) if 'abortion' in p.get_text().lower()], default=0)

# Find the index of the paragraph starting with "PENCE: But for me"
pence_index = next((i for i, p in enumerate(all_paragraphs) if p.get_text().startswith("PENCE: But for me")), None)

# Extract paragraphs between "PENCE: But for me" and the last mention of "abortion"
paragraphs_selected = all_paragraphs[pence_index:last_abortion_index + 1]

# Create a DataFrame with columns for 'name', 'dem/rep', 'left/right', and 'content'
data = {'name': [], 'year': [], 'dem/rep': [], 'left/right': [], 'content': []}

current_speaker = None

for paragraph in paragraphs_selected:
    text = paragraph.get_text()
    
    # Skip paragraphs that start with "QUIJANO:"
    if text.startswith("QUIJANO:"):
        continue
    
    # Determine 'name' and 'dem/rep' based on the speaker
    if 'PENCE:' in text:
        current_speaker = 'PENCE'
        dem_rep = 'rep'
    elif 'KAINE:' in text:
        current_speaker = 'KAINE'
        dem_rep = 'dem'
    
    # Determine 'left/right' based on the speaker
    if 'PENCE:' in text:
        left_right = 'right'
    elif 'KAINE:' in text:
        left_right = 'left'
    
    # Append data to the dictionary
    data['name'].append(current_speaker)
    data['year'].append(2016)
    data['dem/rep'].append(dem_rep)
    data['left/right'].append(left_right)
    data['content'].append(re.sub(r'^PENCE:|KAINE:', '', paragraph.get_text()))
    
# Create a DataFrame from the dictionary
prez_df_2016 = pd.DataFrame(data)

# Print the DataFrame
print(prez_df_2016)
print(prez_df_2016.shape)


     name  year dem/rep left/right  \
0   PENCE  2016     rep      right   
1   PENCE  2016     rep      right   
2   PENCE  2016     rep      right   
3   PENCE  2016     rep      right   
4   PENCE  2016     rep      right   
5   KAINE  2016     dem       left   
6   KAINE  2016     dem       left   
7   KAINE  2016     dem       left   
8   KAINE  2016     dem       left   
9   KAINE  2016     dem       left   
10  KAINE  2016     dem       left   
11  PENCE  2016     rep      right   
12  KAINE  2016     dem       left   
13  PENCE  2016     rep      right   
14  KAINE  2016     dem       left   
15  PENCE  2016     rep      right   
16  KAINE  2016     dem       left   
17  PENCE  2016     rep      right   
18  KAINE  2016     dem       left   
19  PENCE  2016     rep      right   
20  KAINE  2016     dem       left   
21  PENCE  2016     rep      right   
22  KAINE  2016     dem       left   
23  PENCE  2016     rep      right   
24  KAINE  2016     dem       left   
25  PENCE  2

In [9]:
#Scraped Data for Third residential Election 2016 (Hilary Clinton and Donald Trump) 

#October 19, 2016

URL = "https://www.debates.org/voter-education/debate-transcripts/october-19-2016-debate-transcript/" 
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
all_paragraphs = soup.find_all('p')

# Find the indices of paragraphs containing the word "abortion"
abortion_indices = [index for index, p in enumerate(all_paragraphs) if 'abortion' in p.get_text().lower()]

if abortion_indices:
    # Extract paragraphs between the first and last mention of "abortion"
    start_index = abortion_indices[0]
    end_index = abortion_indices[-1]
    
    paragraphs_between_abortion = all_paragraphs[start_index:end_index + 1]

    # Create a DataFrame with columns for 'name', 'year', 'dem/rep', 'left/right', 'content'
    data = {'name': [], 'year': [], 'dem/rep': [], 'left/right': [], 'content': []}

    current_speaker = None

    for paragraph in paragraphs_between_abortion:
        # Skip paragraphs that start with "WALLACE:"
        if not paragraph.get_text().startswith("WALLACE:"):
            # Determine 'name' and 'dem/rep' based on the speaker
            if 'TRUMP:' in paragraph.get_text():
                current_speaker = 'TRUMP'
            elif 'CLINTON:' in paragraph.get_text():
                current_speaker = 'CLINTON'
            
            data['name'].append(current_speaker)
            data['year'].append(2016)
            data['dem/rep'].append('rep' if current_speaker == 'TRUMP' else 'dem')
            data['left/right'].append('right' if current_speaker == 'TRUMP' else 'left')
            data['content'].append(re.sub(r'^TRUMP:|CLINTON:', '', paragraph.get_text()))

    prez_df_2016_2 = pd.DataFrame(data)

    # Print the DataFrame
    print(prez_df_2016_2)
    print(prez_df_2016_2.shape)



       name  year dem/rep left/right  \
0     TRUMP  2016     rep      right   
1     TRUMP  2016     rep      right   
2     TRUMP  2016     rep      right   
3     TRUMP  2016     rep      right   
4   CLINTON  2016     dem       left   
5   CLINTON  2016     dem       left   
6   CLINTON  2016     dem       left   
7   CLINTON  2016     dem       left   
8   CLINTON  2016     dem       left   
9   CLINTON  2016     dem       left   
10    TRUMP  2016     rep      right   
11    TRUMP  2016     rep      right   
12  CLINTON  2016     dem       left   
13  CLINTON  2016     dem       left   

                                              content  
0                                              Right.  
1    Well, if that would happen, because I am pro-...  
2    If they overturned it, it will go back to the...  
3    Well, if we put another two or perhaps three ...  
4    Well, I strongly support Roe v. Wade, which g...  
5   So many states are putting very stringent regu...  
6   Don

In [10]:
#Scraped Data for Vice Presidential Election 2012 (Biden and Ryan)

#October 11, 2012

URL = "https://www.debates.org/voter-education/debate-transcripts/october-11-2012-the-biden-romney-vice-presidential-debate/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
all_paragraphs = soup.find_all('p')

# Find the index of the paragraph containing "RYAN: I don’t see how a"
ryan_paragraph_index = next((index for index, p in enumerate(all_paragraphs) if 'RYAN: I don’t see how a' in p.get_text()), None)

# Find the indices of paragraphs containing the word "abortion" after the specified paragraph
abortion_indices = [index for index, p in enumerate(all_paragraphs[ryan_paragraph_index:]) if 'abortion' in p.get_text().lower()]

if abortion_indices:
    # Adjust the start index to include the paragraph containing "RYAN: I don’t see how a"
    start_index = ryan_paragraph_index
    end_index = ryan_paragraph_index + abortion_indices[-1]
    
    paragraphs_between_abortion = all_paragraphs[start_index:end_index + 1]

    # Create a DataFrame with columns for 'name', 'year', 'dem/rep', 'left/right', 'content'
    data = {'name': [], 'year': [], 'dem/rep': [], 'left/right': [], 'content': []}

    current_speaker = None

    for paragraph in paragraphs_between_abortion:
        # Skip paragraphs that start with "RADDATZ:"
        if not paragraph.get_text().startswith("RADDATZ:"):
            # Determine 'name' and 'dem/rep' based on the speaker
            if 'RYAN:' in paragraph.get_text():
                current_speaker = 'RYAN'
            elif 'BIDEN:' in paragraph.get_text():
                current_speaker = 'BIDEN'
            
            data['name'].append(current_speaker)
            data['year'].append(2012)
            data['dem/rep'].append('rep' if current_speaker == 'RYAN' else 'dem')
            data['left/right'].append('right' if current_speaker == 'RYAN' else 'left')
            data['content'].append(re.sub(r'^RYAN:|BIDEN:', '', paragraph.get_text()))

    prez_df_2012 = pd.DataFrame(data)

    # Print the DataFrame
    print(prez_df_2012)
    print(prez_df_2012.shape)

     name  year dem/rep left/right  \
0    RYAN  2012     rep      right   
1    RYAN  2012     rep      right   
2    RYAN  2012     rep      right   
3    RYAN  2012     rep      right   
4    RYAN  2012     rep      right   
5   BIDEN  2012     dem       left   
6   BIDEN  2012     dem       left   
7   BIDEN  2012     dem       left   
8    RYAN  2012     rep      right   
9    RYAN  2012     rep      right   
10  BIDEN  2012     dem       left   
11  BIDEN  2012     dem       left   
12   RYAN  2012     rep      right   
13   RYAN  2012     rep      right   
14  BIDEN  2012     dem       left   

                                              content  
0    I don’t see how a person can separate their p...  
1    Now, you want to ask basically why I’m pro-li...  
2   You know, I think about 10 1/2 years ago, my w...  
3   That’s why — those are the reasons why I’m pro...  
4   Our church should not have to sue our federal ...  
5    My religion defines who I am, and I’ve been a...  

In [11]:
#Scraped Data for Presidential Election 2008 (Third McCain and Obama Debate)

#October 15, 2008

URL = "https://www.debates.org/voter-education/debate-transcripts/october-15-2008-debate-transcript/" 
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
all_paragraphs = soup.find_all('p')

# Find the indices of paragraphs containing the word "abortion"
abortion_indices = [index for index, p in enumerate(all_paragraphs) if 'abortion' in p.get_text().lower()]

if abortion_indices:
    # Extract paragraphs between the first and last mention of "abortion"
    start_index = abortion_indices[0]
    end_index = abortion_indices[-1]
    
    paragraphs_between_abortion = all_paragraphs[start_index:end_index + 1]

    # Create a DataFrame with columns for 'name', 'year', 'dem/rep', 'left/right', 'content'
    data = {'name': [], 'year': [], 'dem/rep': [], 'left/right': [], 'content': []}

    current_speaker = None

    for paragraph in paragraphs_between_abortion:
        # Skip paragraphs that start with "SCHIEFFER::"
        if not paragraph.get_text().startswith("SCHIEFFER:"):
            # Determine 'name' and 'dem/rep' based on the speaker
            if 'MCCAIN:' in paragraph.get_text():
                current_speaker = 'MCCAIN'
            elif 'OBAMA:' in paragraph.get_text():
                current_speaker = 'OBAMA'
            
            data['name'].append(current_speaker)
            data['year'].append(2008)
            data['dem/rep'].append('rep' if current_speaker == 'MCCAIN' else 'dem')
            data['left/right'].append('right' if current_speaker == 'MCCAIN' else 'left')
            data['content'].append(re.sub(r'^MCCAIN:|OBAMA:', '', paragraph.get_text()))

    prez_df_2008 = pd.DataFrame(data)

    # Print the DataFrame
    print(prez_df_2008)
    print(prez_df_2008.shape)

      name  year dem/rep left/right  \
0   MCCAIN  2008     rep      right   
1    OBAMA  2008     dem       left   
2    OBAMA  2008     dem       left   
3    OBAMA  2008     dem       left   
4    OBAMA  2008     dem       left   
5    OBAMA  2008     dem       left   
6    OBAMA  2008     dem       left   
7    OBAMA  2008     dem       left   
8    OBAMA  2008     dem       left   
9    OBAMA  2008     dem       left   
10  MCCAIN  2008     rep      right   
11  MCCAIN  2008     rep      right   
12  MCCAIN  2008     rep      right   
13  MCCAIN  2008     rep      right   
14  MCCAIN  2008     rep      right   
15  MCCAIN  2008     rep      right   
16  MCCAIN  2008     rep      right   
17   OBAMA  2008     dem       left   
18   OBAMA  2008     dem       left   
19   OBAMA  2008     dem       left   
20   OBAMA  2008     dem       left   
21   OBAMA  2008     dem       left   
22   OBAMA  2008     dem       left   
23   OBAMA  2008     dem       left   
24   OBAMA  2008     dem 

In [12]:
#Scraped Data for Presidential Election 2004 (Bush and Kerry)

#October 13, 2004

URL = "https://www.debates.org/voter-education/debate-transcripts/october-13-2004-debate-transcript/" 
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
all_paragraphs = soup.find_all('p')

# Find the index of the paragraph containing "RYAN: I don’t see how a"
kerry_paragraph_index = next((index for index, p in enumerate(all_paragraphs) if 'KERRY: I respect their views.' in p.get_text()), None)

# Find the indices of paragraphs containing the word "abortion" after the specified paragraph
abortion_indices = [index for index, p in enumerate(all_paragraphs[kerry_paragraph_index:]) if 'abortion' in p.get_text().lower()]

if abortion_indices:
    # Adjust the start index to include the paragraph containing "KERRY: I respect their views."
    start_index = kerry_paragraph_index
    end_index = kerry_paragraph_index + abortion_indices[-1]
    
    paragraphs_between_abortion = all_paragraphs[start_index:end_index + 1]

    # Create a DataFrame with columns for 'name', 'year', 'dem/rep', 'left/right', 'content'
    data = {'name': [], 'year': [], 'dem/rep': [], 'left/right': [], 'content': []}

    current_speaker = None

    for paragraph in paragraphs_between_abortion:
        # Skip paragraphs that start with "SCHIEFFER"
        if not paragraph.get_text().startswith("SCHIEFFER"):
            # Determine 'name' and 'dem/rep' based on the speaker
            if 'BUSH:' in paragraph.get_text():
                current_speaker = 'BUSH'
            elif 'KERRY:' in paragraph.get_text():
                current_speaker = 'KERRY'
            
            data['name'].append(current_speaker)
            data['year'].append(2004)
            data['dem/rep'].append('rep' if current_speaker == 'BUSH' else 'dem')
            data['left/right'].append('right' if current_speaker == 'BUSH' else 'left')
            data['content'].append(re.sub(r'^BUSH:|KERRY:', '', paragraph.get_text()))

    prez_df_2004 = pd.DataFrame(data)

    # Print the DataFrame
    print(prez_df_2004)
    print(prez_df_2004.shape)

     name  year dem/rep left/right  \
0   KERRY  2004     dem       left   
1   KERRY  2004     dem       left   
2   KERRY  2004     dem       left   
3   KERRY  2004     dem       left   
4   KERRY  2004     dem       left   
5   KERRY  2004     dem       left   
6   KERRY  2004     dem       left   
7   KERRY  2004     dem       left   
8   KERRY  2004     dem       left   
9   KERRY  2004     dem       left   
10  KERRY  2004     dem       left   
11  KERRY  2004     dem       left   
12  KERRY  2004     dem       left   
13   BUSH  2004     rep      right   
14   BUSH  2004     rep      right   
15   BUSH  2004     rep      right   
16   BUSH  2004     rep      right   

                                              content  
0    I respect their views. I completely respect t...  
1   I believe that I can’t legislate or transfer t...  
2   I believe that choice is a woman’s choice. It’...  
3   Now, I will not allow somebody to come in and ...  
4   The president has never said wh

In [13]:
#Scraped Data for Presidential Election 2004 (Bush and Kerry)

#October 8, 2004

URL = "https://www.debates.org/voter-education/debate-transcripts/october-8-2004-debate-transcript/" 
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
all_paragraphs = soup.find_all('p')

# Find the index of the paragraph containing "RYAN: I don’t see how a"
starting_index = next((index for index, p in enumerate(all_paragraphs) if 'DEGENHART: Senator Kerry, suppose' in p.get_text()), None)

# Find the indices of paragraphs containing the word "abortion" after the specified paragraph
abortion_indices = [index for index, p in enumerate(all_paragraphs[starting_index:]) if 'abortion' in p.get_text().lower()]

if abortion_indices:
    end_index = starting_index + abortion_indices[-1]
    
    paragraphs_between_abortion = all_paragraphs[starting_index:end_index + 1]

    # Create a DataFrame with columns for 'name', 'year', 'dem/rep', 'left/right', 'content'
    data = {'name': [], 'year': [], 'dem/rep': [], 'left/right': [], 'content': []}

    current_speaker = None

    for paragraph in paragraphs_between_abortion:
        # Skip paragraphs that start with "DEGENHART:"
        if not paragraph.get_text().startswith("DEGENHART:") and not paragraph.get_text().startswith("GIBSON:"):
            # Determine 'name' and 'dem/rep' based on the speaker
            if 'BUSH:' in paragraph.get_text():
                current_speaker = 'BUSH'
            elif 'KERRY:' in paragraph.get_text():
                current_speaker = 'KERRY'
            
            data['name'].append(current_speaker)
            data['year'].append(2004)
            data['dem/rep'].append('rep' if current_speaker == 'BUSH' else 'dem')
            data['left/right'].append('right' if current_speaker == 'BUSH' else 'left')
            data['content'].append(re.sub(r'^BUSH:|KERRY:', '', paragraph.get_text()))

    prez_df_2004_2 = pd.DataFrame(data)

    # Print the DataFrame
    print(prez_df_2004_2)
    print(prez_df_2004_2.shape)

     name  year dem/rep left/right  \
0   KERRY  2004     dem       left   
1   KERRY  2004     dem       left   
2   KERRY  2004     dem       left   
3   KERRY  2004     dem       left   
4   KERRY  2004     dem       left   
5   KERRY  2004     dem       left   
6   KERRY  2004     dem       left   
7   KERRY  2004     dem       left   
8   KERRY  2004     dem       left   
9   KERRY  2004     dem       left   
10   BUSH  2004     rep      right   
11   BUSH  2004     rep      right   
12   BUSH  2004     rep      right   
13   BUSH  2004     rep      right   
14   BUSH  2004     rep      right   
15   BUSH  2004     rep      right   
16   BUSH  2004     rep      right   
17   BUSH  2004     rep      right   
18   BUSH  2004     rep      right   
19   BUSH  2004     rep      right   
20   BUSH  2004     rep      right   
21   BUSH  2004     rep      right   
22  KERRY  2004     dem       left   
23  KERRY  2004     dem       left   
24  KERRY  2004     dem       left   
25   BUSH  2

In [14]:
#Scraped Data for Presidential Election 2000 (Gore and Bush)

#October 8, 2004

URL = "https://www.debates.org/voter-education/debate-transcripts/october-3-2000-transcript/" 
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
all_paragraphs = soup.find_all('p')

# Find the indices of paragraphs containing the word "abortion"
abortion_indices = [index for index, p in enumerate(all_paragraphs) if 'abortion' in p.get_text().lower()]

if abortion_indices:
    # Extract paragraphs between the first and last mention of "abortion"
    start_index = abortion_indices[0]
    end_index = abortion_indices[-1]
    
    paragraphs_between_abortion = all_paragraphs[start_index:end_index + 1]

    # Create a DataFrame with columns for 'name', 'year', 'dem/rep', 'left/right', 'content'
    data = {'name': [], 'year': [], 'dem/rep': [], 'left/right': [], 'content': []}

    current_speaker = None

    for paragraph in paragraphs_between_abortion:
        # Skip paragraphs that start with "MODERATOR::"
        if not paragraph.get_text().startswith("MODERATOR:"):
            # Determine 'name' and 'dem/rep' based on the speaker
            if 'BUSH:' in paragraph.get_text():
                current_speaker = 'BUSH'
            elif 'GORE:' in paragraph.get_text():
                current_speaker = 'GORE'
            
            data['name'].append(current_speaker)
            data['year'].append(2000)
            data['dem/rep'].append('rep' if current_speaker == 'BUSH' else 'dem')
            data['left/right'].append('right' if current_speaker == 'BUSH' else 'left')
            data['content'].append(re.sub(r'^BUSH:|GORE:', '', paragraph.get_text()))

    prez_df_2000 = pd.DataFrame(data)

    # Print the DataFrame
    print(prez_df_2000)
    print(prez_df_2000.shape)

   name  year dem/rep left/right  \
0  BUSH  2000     rep      right   
1  GORE  2000     dem       left   

                                             content  
0   I don’t think a president can do that. I was ...  
1   Well, Jim, the FDA took 12 years, and I do su...  
(2, 5)


#### Merge Dataframes 

Make sure there are 140 rows and 5 columns. This will be the merged dataframe for all the debates.

In [15]:
prez_merged = pd.concat([prez_df_2020, prez_df_2016, prez_df_2016_2, prez_df_2012, prez_df_2008, prez_df_2004_2, prez_df_2004, prez_df_2000], ignore_index=True)
print(prez_merged)
print(prez_merged.shape)

       name  year dem/rep left/right  \
0     PENCE  2020     rep      right   
1     PENCE  2020     rep      right   
2     PENCE  2020     rep      right   
3    HARRIS  2020     dem       left   
4     PENCE  2020     rep      right   
..      ...   ...     ...        ...   
135    BUSH  2004     rep      right   
136    BUSH  2004     rep      right   
137    BUSH  2004     rep      right   
138    BUSH  2000     rep      right   
139    GORE  2000     dem       left   

                                               content  
0     Well thank you for the question, but I’ll use...  
1     My hope is that when the hearing takes place,...  
2     – treated respectfully and voted and confirme...  
3     Thank you, Susan. First of all, Joe Biden and...  
4     Well, thank you, Susan. Let me just say, addr...  
..                                                 ...  
135  Take, for example, the ban on partial birth ab...  
136  What I’m saying is, is that as we promote life...  
137  T

#### Analyze Top Words Used in Presidential Debate (Left)

In [44]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

# Filter rows where 'dem/rep' is 'dem'
dem_df = prez_merged[prez_merged['dem/rep'] == 'dem']

# Combine all content into a single string
left_content = ' '.join(dem_df['content'])

#custom stopwords
custom_stopwords = ['president','think','people','wade','said','make','would','know','abortion']

# Create a CountVectorizer
vectorizer = CountVectorizer(stop_words=stopwords.words('english')+custom_stopwords)

# Fit and transform the content
word_matrix = vectorizer.fit_transform([left_content])

# Get feature names (words)
left_words = vectorizer.get_feature_names_out()

# Get word counts
word_counts = word_matrix.toarray()[0]

# Create a DataFrame with words and counts
word_df = pd.DataFrame({'word': left_words, 'count': word_counts})

# Sort DataFrame by count in descending order
word_df = word_df.sort_values(by='count', ascending=False)

# Print the top words
print(word_df.head(10))

         word  count
693     women     22
537     right     16
360      life     15
233     faith     15
543       roe     15
323     issue     12
158     court     12
692     woman     12
109  catholic     12
116    choice     11


### Analyze Top Words Used in Presidential Debate (Right)

In [43]:
# Filter rows where 'dem/rep' is 'rep'
rep_df = prez_merged[prez_merged['dem/rep'] == 'rep']

# Combine all content into a single string
right_content = ' '.join(rep_df['content'])

#custom stopwords
custom_stopwords = ['abortion','know','think','would','people','senator','abortions']

# Create a CountVectorizer
vectorizer = CountVectorizer(stop_words=stopwords.words('english')+custom_stopwords)

# Fit and transform the content
word_matrix = vectorizer.fit_transform([right_content])

# Get feature names (words)
right_words = vectorizer.get_feature_names_out()

# Get word counts
word_counts = word_matrix.toarray()[0]

# Create a DataFrame with words and counts
word_df = pd.DataFrame({'word': right_words , 'count': word_counts})

# Sort DataFrame by count in descending order
word_df = word_df.sort_values(by='count', ascending=False)

# Print the top words
print(word_df.head(10))

        word  count
307     life     45
31   america     16
417      pro     15
133    court     12
581    voted     10
77     birth     10
422  promote     10
520  supreme      9
99     child      9
508   states      9


In [51]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

def overall_sentiment(text,analyzer=SentimentIntensityAnalyzer()):  
    
    sentiment_dict = analyzer.polarity_scores(text)
    comp = sentiment_dict['compound']
    
    #overall sentiment
    if comp >= 0.05: 
        sentiment = 'positive'
    elif comp <= -0.05: 
        sentiment = 'negative'
    else: 
        sentiment = 'neutral'
    sentiment_dict['Overall sentiment'] = sentiment 
    return sentiment_dict 

print("Overall Sentiment for Left:")
print(overall_sentiment(left_content))

print("\nOverall Sentiment for Right:")
print(overall_sentiment(right_content))

Overall Sentiment for Left:
{'neg': 0.077, 'neu': 0.782, 'pos': 0.141, 'compound': 0.9997, 'Overall sentiment': 'positive'}

Overall Sentiment for Right:
{'neg': 0.066, 'neu': 0.743, 'pos': 0.192, 'compound': 0.9999, 'Overall sentiment': 'positive'}


#### Analysis of **VADER**

Both the left and right show highly positive VADER sentiment analysis scores. This makes sense consiering the context of 

## 3. Methods

### DEADLINE: TUE 12/5 (Midnight)

- Use Fightingwords to examine partisan difference in news outlets (FIGHTING WORDS LECTURE 10, use given library, used in PSET 3) - look at words 
- Use VADER on the presidential debates + congressional speech phrasings (VADER WAS FOUND IN PSET 1)
- Use (k-means) Clustering of phrasings (apply to all) (FOUND IN PSET 2, LECTURE 6, 7)
- Use BERT Classification (PSET 5, Lecture 15, 16, article discussed)
- When dealing with CENTER, we would use VADER to examine the connotation of words used within the outlet and we also compare that to the fightingwords identified for left and right and examine the distribution of those words within that outlet
- 

## 4. Results

### DEADLINE: THURS 12/7 (Midnight)

## 5. Discussion and conclusions

### DEADLINE: 12/8