# Final project guidelines

**Note:** Use these guidelines if and only if you are pursuing a **final project of your own design**. For those taking the final exam instead of the project, see the (separate) final exam notebook.

## Guidelines

These guidelines are intended for **undergraduates enrolled in INFO 3350**. If you are a graduate student enrolled in INFO 6350, you're welcome to consult the information below, but you have wider latitude to design and develop your project in line with your research goals.

### The task

Your task is to: identify an interesting problem connected to the humanities or humanistic social sciences that's addressable with the help of computational methods, formulate a hypothesis about it, devise an experiment or experiments to test your hypothesis, present the results of your investigations, and discuss your findings.

These tasks essentially replicate the process of writing an academic paper. You can think of your project as a paper in miniature.

You are free to present each of these tasks as you see fit. You should use narrative text (that is, your own writing in a markdown cell), citations of others' work, numerical results, tables of data, and static and/or interactive visualizations as appropriate. Total length is flexible and depends on the number of people involved in the work, as well as the specific balance you strike between the ambition of your question and the sophistication of your methods. But be aware that numbers never, ever speak for themselves. Quantitative results presented without substantial discussion will not earn high marks. 

Your project should reflect, at minimum, ten **or more** hours of work by each participant, though you will be graded on the quality of your work, not the amount of time it took you to produce it. Most high-quality projects represent twenty or more hours of work by each member.

#### Pick an important and interesting problem!

No amount of technical sophistication will overcome a fundamentally uninteresting problem at the core of your work. You have seen many pieces of successful computational humanities research over the course of the semester. You might use these as a guide to the kinds of problems that interest scholars in a range of humanities disciplines. You may also want to spend some time in the library, reading recent books and articles in the professional literature. **Problem selection and motivation are integral parts of the project.** Do not neglect them.

### Format

You should submit your project as a Jupyter notebook, along with all data necessary to reproduce your analysis. If your dataset is too large to share easily, let us know in advance so that we can find a workaround. If you have a reason to prefer a presentation format other than a notebook, likewise let us know so that we can discuss the options.

Your report should have four basic sections (provided in cells below for ease of reference):

1. **Introduction and hypothesis.** What problem are you working on? Why is it interesting and important? What have other people said about it? What do you expect to find?
2. **Corpus, data, and methods.** What data have you used? Where did it come from? How did you collect it? What are its limitations or omissions? What major methods will you use to analyze it? Why are those methods the appropriate ones?
3. **Results.** What did you find? How did you find it? How should we read your figures? Be sure to include confidence intervals or other measures of statistical significance or uncetainty where appropriate.
4. **Discussion and conclusions.** What does it all mean? Do your results support your hypothesis? Why or why not? What are the limitations of your study and how might those limitations be addressed in future work?

Within each of those sections, you may use as many code and markdown cells as you like. You may, of course, address additional questions or issues not listed above.

All code used in the project should be present in the notebook (except for widely-available libraries that you import), but **be sure that we can read and understand your report in full without rerunning the code**. Be sure, too, to explain what you're doing along the way, both by describing your data and methods and by writing clean, well commented code.

### Grading

This project takes the place of the take-home final exam for the course. It is worth 35% of your overall grade. You will be graded on the quality and ambition of each aspect of the project. No single component is more important than the others.

### Practical details

* The project is due at **noon on Saturday, December 9** via upload to CMS of a single zip file containing your fully executed Jupyter notebook and all associated data.
* You may work alone or in a group of up to three total members.
    * If you work in a group, be sure to list the names of the group members.
    * For groups, create your group on CMS and submit one notebook for the entire group. **Each group should also submit a statement of responsibility** that describes in general terms who performed which parts of the project.
* You may post questions on Ed, but should do so privately (visible to course staff only).
* Interactive visualizations do not always work when embedded in shared notebooks. If you plan to use interactives, you may need to host them elsewhere and link to them.

---

## Your info
* NetID(s): sc2548, jsc342
* Name(s): Stephy Chen, Joyce Chen
---

## Brainstorm Ideas for Project ##

1. Examining lyrics of songs from varying genres, artists, and etc. (possibly taken from billboard, top 100 charts, etc) to see whether they reflect popular trends during that period of time either based on the current events occuring (may include poltically or socially motivated events)  --

2. Examining how authors (perhaps classical authors from a certain range of time/genre) used adjectives/certain connotative words/phrases to describe gender? (Could we predict information about the author through these used phrases)

3. Examining song lyrics from varying genres/artists to determine how they describe gender. Seeing whether the language used or connotation of their words reflect a trend in the type of artists that sang or wrote the lyrics.

4. Examining media and news outlets diction in describing current events relating to political partisanship (whether their words indicate the general party they're leaning towards)
   - Taking at whether the words used to describe certain events are more comparably positive/negative in connotation
   - Seeing whether there is a distinction between the type of words used to describe current events by left/right wing
   - implications: 
     - Being aware of echo chambers 
     - Recommending news outlets/media that are more neutral based
     - Indicating which news outlets/media are more biased towards a side
     - Is there a correlation with the recent political polarization 
https://www.mediacloud.org/media-cloud-directory 

5. Examine reddit and discussion forums to understand incel culture/crimes against women descriptions to see 


FINAL DECISION: CHOICE 4 

---


## 1. Introduction

In an era marked by unprecedented political polarization, the focus on media and its role in shaping public perception has never been more critical. This data project delves into the intricate web of language, word choice, and stylistic choices employed by various news outlets when reporting on current events, with a keen emphasis on political partisanship.

The importance of this investigation lies in its potential to uncover implicit biases within media narratives and disseminated information. Firstly, we scrutinize the diction and connotations used to describe events among news outlets, seeking to discern patterns that determine whether their choice of words or coverage of events indicates a leaning towards a particular political party. Simultaneously, we conduct a comparative analysis of language used by different media outlets (across the political spectrum) when covering the same events, unraveling distinct narratives crafted by left-wing and right-wing sources to evaluate the implications of these differing perspectives. Secondly, the exploration extends beyond mere observation, delving into whether certain events are portrayed with a comparably positive or negative bias, contributing to the broader discourse on media objectivity.

The significance of this research is far-reaching and transcends academic curiosity. It holds profound implications for societal awareness and media literacy. By making citizens aware of potential echo chambers in media consumption, the project aims to empower individuals to critically evaluate the information they receive. Beyond mere awareness, our project aspires to offer recommendations for news outlets and media that demonstrate a more neutral stance, allowing consumers to make informed choices about their news sources.

Existing literature reviews have already shed light on the political slant present in many media sources (the picture below from AllSides is widely used when looking at the media outlets). This project builds upon this foundation, seeking not only to confirm these biases but also to provide a nuanced understanding of the language that perpetuates them. As political polarization continues to shape the socio-political landscape, this research contributes to the ongoing dialogue by exploring the correlation between media language and the evolving dynamics of political partisanship.

![image](pictures/all-sides.jpeg)

https://www.allsides.com/media-bias/media-bias-chart 

https://www.pewresearch.org/journalism/2014/10/21/section-1-media-sources-distinct-favorites-emerge-on-the-left-and-right/


## 2. Research Question and Hypothesis

### Hypothesis

We hypothesize that there is a correlation between the language and style employed by news outlets in reporting current events and political partisanship. 

(if specificity is needed, here is a version: We hypothesize that there is a positive correlation between the language and style employed by news outlets in reporting current events and political partisanship. Specifically, we anticipate that news outlets leaning towards a particular political party will use language and style that align with the ideologies of that party.)

If would be better for us to focus on a specific topic. 
Single issue (couple of hundred of articles) about two outlets. 

### Research Question 

Do news outlets' language and style in reporting current events correlate with political partisanship? How much do these language choices link to the perceived positivity or negativity of specific events?

### Selected News Outlets 

Out of all these outlets seen in the AllSides graph, we will choose 3 from each category to analyze the articles published. (Should we have a separate section where we specifically pick articles from different sources that are convering the same events or is that included within the 3 that we choose from each category?)

#### Left 
- CNN 
- Buzzfeed News 
- HuffPost 
- MSNBC 
- The New Yorker 

#### Left-Leaning 
- Bloomberg 
- CBS 
- NBC 
- New York Times News
- Washington Post 
- USA Today 

#### Center
- Wall Street Journal News 
- Reuters 
- Newsweek 
- BBC 
- Reuters 
- The Hill 

#### Right-Leaning 
- Epoch Times 
- The Washington Times 
- The Post Millennial 
- The American Conservative 
- The Dispatch 

#### Right
- Daily Mail 
- Daily Wire 
- Fox News 
- The Federalist 
- The American Spectator



OFFICIAL LIST

#### Left 
- CNN  (FINISHED)

#### Left-Leaning 
- New York Times News (FINISHED)

#### Center
- Wall Street Journal News (FINISHED)

### Right-Leaning
- New York Post (Finished)

#### Right 
- Fox News (FINISHED)




## 2. Corpus & Data Cleaning

### Corpus Creation: DEADLINE (12/05)

1. Scrape a ridiculous amount of articles from news outlets

2. Read relevant papers (on abortion and the work that has been done to analyze thus far) 
http://languagelog.ldc.upenn.edu/myl/Monroe.pdf
https://www.pewresearch.org/religion/fact-sheet/public-opinion-on-abortion/#CHAPTER-h-views-on-abortion-2021-a-detailed-look 

3. Standard of Comparison Creation Part I: Examine the partisan of congressional speeches to create the spectrum of words that indicate whether certain phrasing belongs to a certain party
https://data.stanford.edu/congress_text (scroll down on page to find multiple zip files)

4. Standard of Comparison Creation Part II: Examine presidential debates regarding abortion and use that as a standard to characterize the political stance on abortion (whether they use or have positive/negative views) 
https://www.debates.org/voter-education/debate-transcripts/   

- Finding similarity of the phrasings used between the standards created (backed up by scholarly articles) with the phrasings commonly found within news articles

  

## Data Scraping News Outlet Articles

In [35]:
import pandas as pd

def read_txt_to_dataframe(file_path):
    # Initialize empty lists to store data
    news_outlets = []
    titles = []
    authors = []
    publication_dates = []
    article_contents = []

    # Open the text file
    with open(file_path, 'r') as file:
        lines = file.readlines()

        # Initialize variables to store information
        current_article = {}
        
        # Initialize list to store dictionaries
        articles_data = []
        
        # Iterate through lines in the file
        for line in lines:
            # Split the line into key and value if possible
            line_parts = line.split(':', 1)
            
            # Check if the line can be split into key and value
            if len(line_parts) == 2:
                key, value = map(str.strip, line_parts)
                
                # Check for the end of an article
                if key == 'Article_Content':
                    # Save the current article information
                    current_article['Article_Content'] = value.strip()
                    
                    articles_data.append({
                        'News_Outlet': current_article.get('News_Outlet', ''),
                        'Title': current_article.get('Title', ''),
                        'Author': current_article.get('Author', ''),
                        'Publication_Date': current_article.get('Publication_Date', ''),
                        'Article_Content': current_article.get('Article_Content', '')
                    })

                    current_article = {}
                else:
                    # Add key-value pair to current article dictionary
                    current_article[key] = value

    df = pd.DataFrame(articles_data)

    return df


In [122]:
cnn_data_frame = read_txt_to_dataframe('CNN.txt')

cnn_data_frame = cnn_data_frame.reset_index()
cnn_data_frame = cnn_data_frame.rename(columns={'index': 'docid'})

cnn_data_frame.head(10)


Unnamed: 0,docid,News_Outlet,Title,Author,Publication_Date,Article_Content
0,0,CNN,"Abortion is ancient history: Long before Roe, ...",Katie Hunt,"Published 7:29 AM EDT, Fri June 23, 2023","CNN — Abortion today, at least in the United ..."
1,1,CNN,Another state passed a near-total abortion ban...,Zachary B. Wolf,"Updated 2:13 AM EDT, Wed September 14, 2022",CNN — Good luck trying to keep on top of the ...
2,2,CNN,Myths about abortion and women’s mental health...,Sandee LaMotte,"Updated 11:12 PM EDT, Mon July 11, 2022",CNN — It’s an unfounded message experts say i...
3,3,CNN,Survey finds widespread confusion around medic...,Deidre McPhillips,"Published 6:13 AM EST, Wed February 1, 2023",CNN — Nearly half of adults in the United Sta...
4,4,CNN,Births have increased in states with abortion ...,Deidre McPhillips,"Published 1:22 PM EST, Tue November 21, 2023",CNN — Nearly a quarter of people seeking an ...
5,5,CNN,How outlawing abortion could worsen America’s ...,Priya Krishnakumar and Daniel Wolfe,"Updated 11:58 AM EDT, Fri June 24, 2022",CNN — Dr. Judette Louis recalls a time when s...
6,6,CNN,Abortion laws impact people trying to become p...,Madeline Holcombe,"Published 9:50 AM EDT, Sun July 3, 2022",CNN — All Sharon McRae and her husband wanted...
7,7,CNN,Maternal and infant death rates are higher in ...,Jacqueline Howard,"Updated 8:29 AM EST, Fri December 16, 2022",CNN — The rates of mothers and newborn babie...
8,8,CNN,"How a medication abortion, also known as an ‘a...",Sandee LaMotte,"Updated 4:57 AM EDT, Thu May 11, 2023",CNN — While legal battles over access to mife...
9,9,CNN,"Because of Florida abortion laws, she carried ...","Elizabeth Cohen, Carma Hassan and Amanda Musa","Updated 10:32 AM EDT, Wed May 3, 2023","CNN — A Florida woman, unable to get an abor..."


In [121]:
nyt_data_frame = read_txt_to_dataframe('NYT.txt')

nyt_data_frame = nyt_data_frame.reset_index()
nyt_data_frame = nyt_data_frame.rename(columns={'index': 'docid'})

nyt_data_frame.head(10)

Unnamed: 0,docid,News_Outlet,Title,Author,Publication_Date,Article_Content
0,0,NYT,Could Abortion Rights Rescue Red-State Democra...,Michael C. Bender and Anjali Huynh,"Nov. 29, 2023",Senator Sherrod Brown of Ohio is seen as one o...
1,1,NYT,Texas Woman Asks Court to Allow Her Abortion,J. David Goodman,"Dec. 5, 2023","A woman who is 20 weeks pregnant, and whose fe..."
2,2,NYT,Newsom-DeSantis Debate DeSantis and Newsom Slu...,"Shane Goldmacher, Jonathan Weisman, Nicholas N...","Published Nov. 30, 2023, Updated Dec. 6, 2023",Ron DeSantis was more aggressive than he had b...
3,3,NYT,"Talk About Abortion, Don’t Talk About Trump: G...",Reid J. Epstein,"Dec. 4, 2023","At an annual gathering in Arizona, Democratic ..."
4,4,NYT,Tuberville Drops Blockade of Most Military Pro...,Catie Edmondson,"Dec. 5, 2023","Under pressure from senators in both parties, ..."
5,5,NYT,Justice O’Connor’s Judicial Legacy Was Undermi...,Adam Liptak,"Dec. 1, 2023","Since her retirement in 2006, the court has di..."
6,6,NYT,Ohio Vote Continues a Winning Streak for Abort...,Kate Zernike,"Nov. 7, 2023",The State Constitution will protect access to ...
7,7,NYT,Florida Republicans Propose 6-Week Abortion Ban,David W. Chen and Patricia Mazzei,"March 7, 2023",The bills would tighten the current 15-week li...
8,8,NYT,Ohio Is Voting on Whether to Establish Abortio...,Kate Zernike,"Nov. 7, 2023",The outcome is being closely watched by Democr...
9,9,NYT,Abortion Bans Fail in South Carolina and Nebraska,Adeel Hassan and Eliza Fawcett,"Published April 27, 2023 Updated May 1, 2023","The News South Carolina and Nebraska, two cons..."


In [120]:
fox_data_frame = read_txt_to_dataframe('FOX.txt')

fox_data_frame = fox_data_frame.reset_index()
fox_data_frame = fox_data_frame.rename(columns={'index': 'docid'})

fox_data_frame.head(10)

Unnamed: 0,docid,News_Outlet,Title,Author,Publication_Date,Article_Content
0,0,FOX,Cosmopolitan magazine shares steps for how to ...,Alexander Hall,"Published December 3, 2023 9:00am EST",The ritual as described is concluded by declar...
1,1,FOX,Tuberville ends blockade of most military prom...,Chris Pandolfo,"Published December 5, 2023 2:11pm EST Updated ...","Sen. Tommy Tuberville, R-Ala., cleared the way..."
2,2,FOX,Haley calls for 'consensus' on issue of aborti...,Brooke Singman,"Published August 23, 2023 10:19pm EDT",GOP candidates were asked if they would suppor...
3,3,FOX,"Ron DeSantis signs six-week abortion law, deli...",Gabriel Hays,"Published April 14, 2023 11:30am EDT",Florida pro-lifers were overjoyed at the gover...
4,4,FOX,Supreme Court overturns Roe v. Wade in landmar...,"Ronn Blitzer , Kelly Laco","Published June 24, 2022 10:11am EDT Updated Ju...",Supreme Court decision overturning Roe v. Wade...
5,5,FOX,Fox News Poll: Two-thirds say abortion pill sh...,Victoria Balara,"Published April 27, 2023 6:00pm EDT",Fox News Polling found that 30% of voters beli...
6,6,FOX,Federal appeals court restricts access to abor...,"Adam Sabes , David Spunt","Published August 16, 2023 6:18pm EDT",The Supreme Court previously issued an injunct...
7,7,FOX,Pro-life Republicans say FDA approval of abort...,Chris Pandolfo,"Published April 12, 2023 7:40am EDT",Republican lawmakers urging Fifth Circuit Cour...
8,8,FOX,DeSantis warns the left will 'weaponize' Trump...,Danielle Wallace,"Published September 24, 2023 7:22am EDT",lorida Gov. Ron DeSantis says left will 'weapo...
9,9,FOX,Newsom refuses to say if he supports any limit...,Joseph A. Wulfsohn,"Published September 19, 2023 8:30pm EDT",CNN's Dana Bash repeatedly pressed the Califor...


In [119]:
wsj_data_frame = read_txt_to_dataframe('WSJ.txt')

wsj_data_frame = wsj_data_frame.reset_index()
wsj_data_frame = wsj_data_frame.rename(columns={'index': 'docid'})

wsj_data_frame.head(10)

Unnamed: 0,docid,News_Outlet,Title,Author,Publication_Date,Article_Content
0,0,WSJ,Tommy Tuberville Backs Down in Fight Over Mili...,Lindsay Wise and Nancy A. Youssef,"Updated Dec. 5, 2023 5:42 pm ET",Alabama Sen. Tommy Tuberville lost the support...
1,1,WSJ,The 2024 Election Rematch That Americans Dread...,"Ken Thomas, Catherine Lucey, and John McCormick","Updated Nov. 5, 2023 11:38 am ET","A year before the 2024 election, a divided nat..."
2,2,WSJ,"Support for Abortion Access Is Near Record, Po...",Julie Wernau,"Nov. 20, 2023 9:00 am ET",Some 55% of respondents say it should be possi...
3,3,WSJ,How Democrats Are Leveraging Abortion Rights t...,Laura Kusisto and Jimmy Vielkind,"Nov. 10, 2023 9:00 pm ET",Ohio’s passage of a measure to protect abortio...
4,4,WSJ,"Abortion-Rights Supporters Rack Up Victories, ...",Aaron Zitner and Laura Kusisto,"Updated Nov. 8, 2023 10:43 pm ET",From red Ohio and Kentucky to purple Virginia ...
5,5,WSJ,Tuberville’s One-Man Stand Strains Senate Pati...,Molly Ball,"Nov. 13, 2023 5:00 am ET",Sen. Tommy Tuberville’s crusade has earned him...
6,6,WSJ,Democrats Grow More Confident in Campaign Mess...,"Annie Linskey, Ken Thomas and Katy Stech Ferek","November 9, 2023 04:15 pm ET",Some in the party worry that the president wil...
7,7,WSJ,Republican Senators Confront Tommy Tuberville ...,Lindsay Wise and Nancy A. Youssef,"November 1, 2023 11:08 pm ET",The Alabama senator has staged a months long p...
8,8,WSJ,Tuberville Pushes to Confirm Marines’ No. 2,"Katy Stech Ferek, Lindsay Wise and Gordon Lubold","October 31, 2023 08:34 pm ET",The Alabama senator is pressing to quickly con...
9,9,WSJ,"On Debate Stage and Campaign Trail, Republican...",Molly Ball,"Nov. 9, 2023 5:00 am ET","For a change, the central problem facing the c..."


In [52]:
nyp_data_frame = read_txt_to_dataframe('NYP.txt')

nyp_data_frame = nyp_data_frame.reset_index()
nyp_data_frame = nyp_data_frame.rename(columns={'index': 'docid'})

nyp_data_frame.head(10)

Unnamed: 0,docid,News_Outlet,Title,Author,Publication_Date,Article_Content
0,0,NYP,California's Newsom slams Walgreens on abortio...,"Eric Revell, Fox Business","March 7, 2023 | 3:50am",California Governor Gavin Newsom (D) slammed W...
1,1,NYP,Arrest made in fire at planned Wyoming abortio...,Associated Press,"Published March 22, 2023, 10:10 p.m. ET","Lorna Roxanne Green, 22, was arrested on charg..."
2,2,NYP,Texas activists push for ‘trafficking’ laws to...,Isabel Vincent,"Published Sep. 2, 2023, 12:53 p.m. ET",Conservative activists in Texas have developed...
3,3,NYP,Idaho governor signs ‘abortion trafficking’ bi...,Associated Press,"Published April 6, 2023, 5:27 a.m. ET",Idaho Gov. Brad Little signed a bill into law ...
4,4,NYP,Federal appeals court backs limits on abortion...,Victor Nava,"Published Aug. 16, 2023 Updated Aug. 16, 2023,...",A federal appeals court on Wednesday ruled in ...
5,5,NYP,Supreme Court allows abortion pill to remain a...,Samuel Chamberlain,"Published April 21, 2023 Updated April 21, 202...",The abortion drug mifepristone will remain ava...
6,6,NYP,AOC channels Andrew Jackson on abortion pill: ...,Mary Kay Linge,"Published April 8, 2023 Updated April 8, 2023,...",Lefty darling Rep. Alexandria Ocasio-Cortez wa...
7,7,NYP,Wesleyan University to cover abortion costs fo...,Katherine Donlevy,"Published May 11, 2023 Updated May 11, 2023, 6...","Starting this fall, Wesleyan University will c..."
8,8,NYP,Texas judge tosses first lawsuit filed under a...,Matthew Sedacca,"Published Dec. 10, 2022, 1:06 p.m. ET",A Texas judge tossed out the first lawsuit fil...
9,9,NYP,House GOP defense budget targets diversity tra...,Caitlin Doornbos,"Published June 15, 2023, 7:03 p.m. ET",WASHINGTON – The House Appropriations Committe...


In [54]:
import pandas as pd

outlet_list = [cnn_data_frame, fox_data_frame, wsj_data_frame, nyt_data_frame, nyp_data_frame]

# Check for repeating strings in the 'Title' column
for outlet in outlet_list:
    repeating_title = outlet['Title'].duplicated()
    repeating_content = outlet['Article_Content'].duplicated()
    # Print the rows with repeating values in the 'Title' column
    print(outlet[repeating_title])
    print(outlet[repeating_content])
    # Check if there are any repeating values in the 'Title' column
    if repeating_title.any():
        print("There are repeating strings in the 'Title' column.")
    else:
        print("No repeating strings in the 'Title' column.")
    if repeating_content.any():
        print("There are repeating strings in the 'Article_Content' column.")
    else:
        print("No repeating strings in the 'Article_Content' column.")




Empty DataFrame
Columns: [docid, News_Outlet, Title, Author, Publication_Date, Article_Content]
Index: []
Empty DataFrame
Columns: [docid, News_Outlet, Title, Author, Publication_Date, Article_Content]
Index: []
No repeating strings in the 'Title' column.
No repeating strings in the 'Article_Content' column.
Empty DataFrame
Columns: [docid, News_Outlet, Title, Author, Publication_Date, Article_Content]
Index: []
Empty DataFrame
Columns: [docid, News_Outlet, Title, Author, Publication_Date, Article_Content]
Index: []
No repeating strings in the 'Title' column.
No repeating strings in the 'Article_Content' column.
Empty DataFrame
Columns: [docid, News_Outlet, Title, Author, Publication_Date, Article_Content]
Index: []
Empty DataFrame
Columns: [docid, News_Outlet, Title, Author, Publication_Date, Article_Content]
Index: []
No repeating strings in the 'Title' column.
No repeating strings in the 'Article_Content' column.
Empty DataFrame
Columns: [docid, News_Outlet, Title, Author, Publicati

In [55]:
import string
from nltk.corpus import stopwords

def word_stats(data, stopwords=None, n=20):
    from nltk import word_tokenize
    from collections import Counter
    
    tokens_data = word_tokenize(data.strip().lower())
    print('Total words in the text:', len(tokens_data))
    
    stops = stopwords.words('english')
    for punct in string.punctuation:
        stops.append(punct)
    
    not_needed = ["``", "''", "’", "n't", "'s", 
                  "could", "would", "should", 
                  "always", "never", "really", 
                  '“','”', 'said', '–', 'abortion',
                  'abortions', 'cnn', 'fox', 'states',
                  'people','—', 'women', 'state', 
                  'court', 'v.', 'also', 'u.s.', 'tuesday',
                  'voters', 'supreme', 'mr.', 'weeks',
                  'year', 'last', 'one', 'new', 'us',
                  'roe','wade']
    stops.extend(not_needed)
    
    word_counter = Counter(tokens_data)
    
    if stopwords is not None:
        for stop in stops:
            del word_counter[stop]
    
    print("\nTop 20 words by frequency:")
    for word, frequency in word_counter.most_common(n):
        print(f"{word}\t {frequency}")
    
    return None

# Assuming you have separate columns for Article_Content in CNN and FOX dataframes
cnn_text = " ".join(cnn_data_frame['Article_Content'].tolist())
fox_text = " ".join(fox_data_frame['Article_Content'].tolist())
nyt_text = " ".join(nyt_data_frame['Article_Content'].tolist())
wsj_text = " ".join(wsj_data_frame['Article_Content'].tolist())
nyp_text = " ".join(nyp_data_frame['Article_Content'].tolist())

# Apply word_stats function to CNN
print("\nWord Stats for CNN:")
word_stats(cnn_text, stopwords, n=10)

# Apply word_stats function to FOX
print("\nWord Stats for FOX:")
word_stats(fox_text, stopwords, n=10)

# Apply word_stats function to CNN
print("\nWord Stats for NYT:")
word_stats(nyt_text, stopwords, n=10)

# Apply word_stats function to FOX
print("\nWord Stats for WSJ:")
word_stats(wsj_text, stopwords, n=10)

# Apply word_stats function to FOX
print("\nWord Stats for NYP:")
word_stats(nyp_text, stopwords, n=10)


Word Stats for CNN:
Total words in the text: 69348

Top 20 words by frequency:
health	 192
pregnancy	 181
law	 179
care	 175
legal	 165
access	 137
medication	 129
ban	 108
laws	 100
get	 95

Word Stats for FOX:
Total words in the text: 35741

Top 20 words by frequency:
ohio	 123
president	 95
issue	 91
news	 89
law	 76
trump	 68
ban	 65
amendment	 65
biden	 63
pregnancy	 60

Word Stats for NYT:
Total words in the text: 53749

Top 20 words by frequency:
ban	 226
rights	 160
law	 148
right	 128
republicans	 123
republican	 123
ohio	 103
pregnancy	 90
desantis	 88
democrats	 87

Word Stats for WSJ:
Total words in the text: 10332

Top 20 words by frequency:
republican	 36
access	 35
ballot	 31
rights	 30
issue	 29
democrats	 28
abortion-rights	 28
amendment	 23
tuberville	 21
kansas	 21

Word Stats for NYP:
Total words in the text: 4003

Top 20 words by frequency:
texas	 31
pregnancy	 20
judge	 20
mifepristone	 19
ban	 17
cox	 16
drug	 12
law	 12
ruling	 12
walgreens	 11


In [112]:
import string
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter

def word_stats(data, stopwords=None, n=20):
    # Tokenize the text
    tokens_data = word_tokenize(data.strip().lower())
    print('Total words in the text:', len(tokens_data))

    # Set up stopwords
    stops = stopwords.words('english')
    for punct in string.punctuation:
        stops.append(punct)
    
    # Additional custom stopwords
    not_needed = ["``", "''", "’", "n't", "'s", 
                  "could", "would", "should", 
                  "always", "never", "really", 
                  '“','”', 'said', '–', 'abortion',
                  'abortions', 'cnn', 'fox', 'states',
                  'people','—', 'state', 'court', 'v.', 
                  'also', 'u.s.', 'tuesday', 'mr',
                  'voters', 'supreme', 'mr.', 'weeks',
                  'year', 'last', 'one', 'new', 'us',
                  'roe','wade', 'week', 'six']
    stops.extend(not_needed)

    # Create CountVectorizer
    vectorizer = CountVectorizer(
        lowercase=True,
        strip_accents='unicode',
        input='content',
        encoding='utf-8',
        stop_words=stops
    )

    # Transform text data to document-term matrix
    X = vectorizer.fit_transform([data])

    # Get feature names from CountVectorizer
    feature_names = vectorizer.get_feature_names_out()

    # Create Counter from feature names
    word_counter = Counter(dict(zip(feature_names, X.toarray()[0])))

    # Print the top 20 words by frequency
    print("\nTop 20 words by frequency:")
    for word, frequency in word_counter.most_common(n):
        print(f"{word}\t {frequency}")

    return None

# Assuming you have separate columns for Article_Content in CNN, FOX, NYT, WSJ, and NYP dataframes
cnn_text = " ".join(cnn_data_frame['Article_Content'].tolist())
fox_text = " ".join(fox_data_frame['Article_Content'].tolist())
nyt_text = " ".join(nyt_data_frame['Article_Content'].tolist())
wsj_text = " ".join(wsj_data_frame['Article_Content'].tolist())
nyp_text = " ".join(nyp_data_frame['Article_Content'].tolist())



In [106]:
# Apply word_stats function to CNN
print("\nWord Stats for CNN:")
word_stats(cnn_text, stopwords, n=10)



Word Stats for CNN:
Total words in the text: 69348

Top 20 words by frequency:
women	 282
health	 194
pregnancy	 189
law	 181
care	 178
legal	 165
access	 144
medication	 131
life	 123
ban	 110


In [107]:
# Apply word_stats function to FOX
print("\nWord Stats for FOX:")
word_stats(fox_text, stopwords, n=10)


Word Stats for FOX:
Total words in the text: 35741

Top 20 words by frequency:
ohio	 123
women	 116
life	 97
president	 95
issue	 94
news	 89
pro	 87
law	 76
trump	 70
amendment	 65


In [113]:
# Apply word_stats function to NYT
print("\nWord Stats for NYT:")
word_stats(nyt_text, stopwords, n=10)


Word Stats for NYT:
Total words in the text: 53749

Top 20 words by frequency:
ban	 228
rights	 182
women	 153
law	 150
republican	 141
right	 132
republicans	 123
life	 110
ohio	 103
pregnancy	 90


In [109]:
# Apply word_stats function to WSJ
print("\nWord Stats for WSJ:")
word_stats(wsj_text, stopwords, n=10)


Word Stats for WSJ:
Total words in the text: 10332

Top 20 words by frequency:
rights	 60
republican	 39
access	 35
ballot	 33
issue	 29
democrats	 28
amendment	 23
kansas	 21
tuberville	 21
women	 21


In [104]:
# Apply word_stats function to NYP
print("\nWord Stats for NYP:")
word_stats(nyp_text, stopwords, n=10)


Word Stats for NYP:
Total words in the text: 4003

Top 20 words by frequency:
texas	 31
judge	 20
mifepristone	 20
pregnancy	 20
ban	 17
cox	 16
drug	 12
fda	 12
law	 12
ruling	 12


In [94]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import fightinwords as fw
import string
from nltk.corpus import stopwords

#Function to display fighting words output
def display_fw(data, n=10, name1='corpus one', name2='corpus two'):
    print("Top terms in", name2)
    for term, score in data[:n]:
        print(f"{term:<10} {score:6.3f}")
    print("")
    print("Top terms in", name1)
    for term, score in reversed(data[-n:]):
        print(f"{term:<10} {score:6.3f}")

#Function to compare fighting words between outlets
def compare_outlets(outlet1, outlet2, data_frame1, data_frame2, stops=None):
    # Merge dataframes using concat
    merged_data_frame = pd.merge(data_frame1[data_frame1['News_Outlet'] == outlet1],
                                 data_frame2[data_frame2['News_Outlet'] == outlet2],
                                 how='outer')

    # Set the index to 'docid'
    new_outlet_merge = merged_data_frame.set_index('docid')

    # Use CountVectorizer to convert text data into a numeric matrix
    all_stops = stopwords.words('english')
    for punct in string.punctuation:
        all_stops.append(punct)
    
    all_stops.extend(stops)

    vectorizer = CountVectorizer(
        lowercase=True,
        strip_accents='unicode',
        input='content',
        encoding='utf-8',
        stop_words=all_stops
    )
    X = vectorizer.fit_transform(new_outlet_merge['Article_Content']).toarray()

    # Labels are in the 'News_Outlet' column
    y = merged_data_frame['News_Outlet']

    # Calculate fighting words
    fw_output = fw.bayes_compare_language(
        l1=np.where(y == outlet1)[0],
        l2=np.where(y == outlet2)[0],
        features=X,
        vocab=vectorizer.get_feature_names_out(),
    )

    # Display fighting words output
    print(f"Fighting Words Analysis between {outlet1} and {outlet2}")
    display_fw(fw_output, n=10, name1=outlet1, name2=outlet2)



In [95]:
stops_cnn_fox = ['news', 'cnn', 'could', 'states', 'us', 'like', 'get',
                 'likely', 'former', 'digital', '2023', 'fox']

print('Comparion #1:')
compare_outlets('CNN', 'FOX', cnn_data_frame, fox_data_frame, 
                stops=stops_cnn_fox)



Comparion #1:
Vocab size is 7571
Comparing language...
Fighting Words Analysis between CNN and FOX
Top terms in FOX
ohio       -8.623
president  -8.481
pro        -7.948
issue      -7.733
trump      -7.171
amendment  -7.141
biden      -6.967
catholic   -5.941
gov        -5.867
democrats  -5.608

Top terms in CNN
medication  5.996
care        5.293
legal       4.594
procedure   4.492
health      4.083
risk        3.904
data        3.870
pregnancy   3.828
cases       3.616
banned      3.613


In [69]:
stops_cnn_wsj = ['news', 'could', 'states', 'us', 'like', 'get',
                 'likely', 'former', 'digital', '2023', 'tuesday',
                 'state', 'results', 'voters', 'rights']

print('Comparion #2:')
compare_outlets('CNN', 'WSJ', cnn_data_frame, wsj_data_frame, 
                stops= stops_cnn_wsj)




Comparion #2:
Vocab size is 6684
Comparing language...
Fighting Words Analysis between CNN and WSJ
Top terms in WSJ
republican -7.897
democrats  -7.695
amendment  -6.994
kansas     -6.368
issue      -6.263
ohio       -6.013
senate     -5.991
michigan   -5.838
election   -5.821
constitution -5.650

Top terms in CNN
health      3.959
care        3.744
women       3.485
pregnancy   3.327
people      3.212
medication  2.690
woman       2.663
texas       2.647
medical     2.620
pregnant    2.611


In [76]:
stops_cnn_nyp = ['news', 'could', 'states', 'us', 'like', 'get',
                 'likely', 'former', 'digital', '2023', 'tuesday',
                 'state', 'results', 'voters', 'rights', '20', 
                 'general', 'attorneys', 'said', 'cnn', 'also', 'one',
                 'may', 'says']


print('Comparion #3:')
compare_outlets('CNN', 'NYP', cnn_data_frame, nyp_data_frame, 
                stops=stops_cnn_nyp)
print('\n')



Comparion #3:
Vocab size is 6310
Comparing language...
Fighting Words Analysis between CNN and NYP
Top terms in NYP
judge      -8.019
texas      -8.001
mifepristone -6.862
biden      -5.528
lawsuit    -5.528
filed      -5.396
ruled      -5.048
fda        -4.867
attorney   -4.349
complications -4.349

Top terms in CNN
access      2.208
women       2.009
health      2.006
care        1.893
laws        1.853
legal       1.705
life        1.427
right       1.335
anti        1.319
take        1.282




In [84]:
stops_nyt_fox = ['fox', 'former', '2023', 'state', 'mr', 
                 'year', 'news', 'six', 'bans', 'could', 'would']

print('Comparion #5:')
compare_outlets('NYT', 'FOX', nyt_data_frame, fox_data_frame, 
                stops=stops_nyt_fox)
print('\n')

Comparion #5:
Vocab size is 6651
Comparing language...
Fighting Words Analysis between NYT and FOX
Top terms in FOX
pro        -6.657
catholic   -5.410
mifepristone -5.249
trump      -5.249
president  -5.215
alito      -4.358
issue      -4.243
amendment  -4.189
ohio       -4.140
haley      -4.051

Top terms in NYT
ban         6.312
rights      4.895
procedure   4.504
carolina    4.416
right       4.376
republicans  4.335
legislature  4.210
bill        4.163
south       4.105
dr          3.784




In [87]:
stops_nyt_wsj = ['2024', 'mr', 'six', 'week', 'weeks', 'tuesday']

print('Comparion #5:')
compare_outlets('NYT', 'WSJ', nyt_data_frame, wsj_data_frame, 
                stops=stops_nyt_wsj)
print('\n')

Comparion #5:
Vocab size is 5571
Comparing language...
Fighting Words Analysis between NYT and WSJ
Top terms in WSJ
voters     -5.960
kansas     -5.759
tuberville -5.674
results    -5.571
michigan   -5.313
races      -5.170
referendum -4.999
military   -4.918
gop        -4.163
promotions -4.117

Top terms in NYT
ban         3.946
desantis    3.071
bill        2.705
justice     2.697
health      2.623
carolina    2.562
law         2.519
south       2.368
care        2.360
anti        2.354




In [90]:
stops_nyt_nyp = ['2024', '20', 'said', 'would', 'ms', 'one', 'also']

print('Comparion #5:')
compare_outlets('NYT', 'NYP', nyt_data_frame, nyp_data_frame, 
                stops=stops_nyt_nyp)
print('\n')

Comparion #5:
Vocab size is 5288
Comparing language...
Fighting Words Analysis between NYT and NYP
Top terms in NYP
texas      -9.255
mifepristone -8.070
cox        -6.982
judge      -5.814
drug       -4.906
lawsuit    -4.434
administration -4.374
charges    -4.238
complications -4.238
pregnancy  -4.085

Top terms in NYT
rights      2.754
republicans  2.301
right       2.191
voters      2.188
ohio        2.121
desantis    1.786
ballot      1.703
access      1.615
life        1.587
constitution  1.559




In [93]:
stops_nyt_cnn = ['2024', 'us', 'people', 'may', 'get', 'ms', 'according',
                 'go']

print('Comparion #5:')
compare_outlets('NYT', 'CNN', nyt_data_frame, cnn_data_frame, 
                stops=stops_nyt_cnn)
print('\n')

Comparion #5:
Vocab size is 8205
Comparing language...
Fighting Words Analysis between NYT and CNN
Top terms in CNN
care       -6.430
medication -5.942
health     -4.754
research   -4.634
baby       -4.602
data       -4.582
pregnancy  -4.336
women      -4.257
risk       -4.096
maternal   -3.991

Top terms in NYT
republicans  8.066
republican  8.016
ban         7.900
voters      7.827
rights      7.221
ohio        7.216
democrats   7.019
state       7.019
constitution  6.451
gov         6.022




## Data Scraping Presidential Elections

Let's scrape the raw text from presidential elections that mention abortions. 


In [96]:
##install requests

! pip install requests

[0m

In [97]:
! pip install nltk


[0m

In [7]:
#Scraped Data for Vice Presidential Election 2020 (Kamala Harris and Mike Pence)

URL = "https://debates.org/voter-education/debate-transcripts/vice-presidential-debate-at-the-university-of-utah-in-salt-lake-city-utah/" 
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
all_paragraphs = soup.find_all('p')

# Find the indices of paragraphs containing the word "abortion"
abortion_indices = [index for index, p in enumerate(all_paragraphs) if 'abortion' in p.get_text().lower()]

if abortion_indices:
    # Extract paragraphs between the first and last mention of "abortion"
    start_index = abortion_indices[0]
    end_index = abortion_indices[-1]
    
    paragraphs_between_abortion = all_paragraphs[start_index:end_index + 1]

    # Create a DataFrame with columns for 'name', 'dem/rep', and 'content'
    data = {'name': [], 'year': [], 'dem/rep': [], 'left/right': [], 'content': []}
    count_pence = 0

    for paragraph in paragraphs_between_abortion:
        count_pence += paragraph.get_text().count('PENCE:')
        
        if count_pence == 4:
            # Modify the paragraph in-place if it's the fourth instance of 'PENCE:'
            paragraph.string = re.sub(r'^PENCE:', 'HARRIS:', paragraph.get_text())
    
        # Skip paragraphs that start with "PAGE:"
        if not paragraph.get_text().startswith("PAGE:"):
            # Determine 'name' and 'dem/rep' based on the speaker
            if 'PENCE:' in paragraph.get_text():
                data['name'].append('PENCE')
                data['year'].append(2020)
                data['dem/rep'].append('rep')
                data['left/right'].append('right')
            else:
                data['name'].append('HARRIS')
                data['year'].append(2020)
                data['dem/rep'].append('dem')
                data['left/right'].append('left')
            
            data['content'].append(re.sub(r'^PENCE:|HARRIS:', '', paragraph.get_text()))

    prez_df_2020 = pd.DataFrame(data)

    # Print the DataFrame
    print(prez_df_2020)
    print(prez_df_2020.shape)
        

     name  year dem/rep left/right  \
0   PENCE  2020     rep      right   
1   PENCE  2020     rep      right   
2   PENCE  2020     rep      right   
3  HARRIS  2020     dem       left   
4   PENCE  2020     rep      right   

                                             content  
0   Well thank you for the question, but I’ll use...  
1   My hope is that when the hearing takes place,...  
2   – treated respectfully and voted and confirme...  
3   Thank you, Susan. First of all, Joe Biden and...  
4   Well, thank you, Susan. Let me just say, addr...  
(5, 5)


In [8]:
#Scraped Data for Vice Presidential Election 2016 (Mike Pence and Tim Kaine)

#October 4, 2016

URL = "https://www.debates.org/voter-education/debate-transcripts/october-4-2016-debate-transcript/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
all_paragraphs = soup.find_all('p')

# Find the index of the paragraph with the last mention of "abortion"
last_abortion_index = max([index for index, p in enumerate(all_paragraphs) if 'abortion' in p.get_text().lower()], default=0)

# Find the index of the paragraph starting with "PENCE: But for me"
pence_index = next((i for i, p in enumerate(all_paragraphs) if p.get_text().startswith("PENCE: But for me")), None)

# Extract paragraphs between "PENCE: But for me" and the last mention of "abortion"
paragraphs_selected = all_paragraphs[pence_index:last_abortion_index + 1]

# Create a DataFrame with columns for 'name', 'dem/rep', 'left/right', and 'content'
data = {'name': [], 'year': [], 'dem/rep': [], 'left/right': [], 'content': []}

current_speaker = None

for paragraph in paragraphs_selected:
    text = paragraph.get_text()
    
    # Skip paragraphs that start with "QUIJANO:"
    if text.startswith("QUIJANO:"):
        continue
    
    # Determine 'name' and 'dem/rep' based on the speaker
    if 'PENCE:' in text:
        current_speaker = 'PENCE'
        dem_rep = 'rep'
    elif 'KAINE:' in text:
        current_speaker = 'KAINE'
        dem_rep = 'dem'
    
    # Determine 'left/right' based on the speaker
    if 'PENCE:' in text:
        left_right = 'right'
    elif 'KAINE:' in text:
        left_right = 'left'
    
    # Append data to the dictionary
    data['name'].append(current_speaker)
    data['year'].append(2016)
    data['dem/rep'].append(dem_rep)
    data['left/right'].append(left_right)
    data['content'].append(re.sub(r'^PENCE:|KAINE:', '', paragraph.get_text()))
    
# Create a DataFrame from the dictionary
prez_df_2016 = pd.DataFrame(data)

# Print the DataFrame
print(prez_df_2016)
print(prez_df_2016.shape)


     name  year dem/rep left/right  \
0   PENCE  2016     rep      right   
1   PENCE  2016     rep      right   
2   PENCE  2016     rep      right   
3   PENCE  2016     rep      right   
4   PENCE  2016     rep      right   
5   KAINE  2016     dem       left   
6   KAINE  2016     dem       left   
7   KAINE  2016     dem       left   
8   KAINE  2016     dem       left   
9   KAINE  2016     dem       left   
10  KAINE  2016     dem       left   
11  PENCE  2016     rep      right   
12  KAINE  2016     dem       left   
13  PENCE  2016     rep      right   
14  KAINE  2016     dem       left   
15  PENCE  2016     rep      right   
16  KAINE  2016     dem       left   
17  PENCE  2016     rep      right   
18  KAINE  2016     dem       left   
19  PENCE  2016     rep      right   
20  KAINE  2016     dem       left   
21  PENCE  2016     rep      right   
22  KAINE  2016     dem       left   
23  PENCE  2016     rep      right   
24  KAINE  2016     dem       left   
25  PENCE  2

In [9]:
#Scraped Data for Third residential Election 2016 (Hilary Clinton and Donald Trump) 

#October 19, 2016

URL = "https://www.debates.org/voter-education/debate-transcripts/october-19-2016-debate-transcript/" 
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
all_paragraphs = soup.find_all('p')

# Find the indices of paragraphs containing the word "abortion"
abortion_indices = [index for index, p in enumerate(all_paragraphs) if 'abortion' in p.get_text().lower()]

if abortion_indices:
    # Extract paragraphs between the first and last mention of "abortion"
    start_index = abortion_indices[0]
    end_index = abortion_indices[-1]
    
    paragraphs_between_abortion = all_paragraphs[start_index:end_index + 1]

    # Create a DataFrame with columns for 'name', 'year', 'dem/rep', 'left/right', 'content'
    data = {'name': [], 'year': [], 'dem/rep': [], 'left/right': [], 'content': []}

    current_speaker = None

    for paragraph in paragraphs_between_abortion:
        # Skip paragraphs that start with "WALLACE:"
        if not paragraph.get_text().startswith("WALLACE:"):
            # Determine 'name' and 'dem/rep' based on the speaker
            if 'TRUMP:' in paragraph.get_text():
                current_speaker = 'TRUMP'
            elif 'CLINTON:' in paragraph.get_text():
                current_speaker = 'CLINTON'
            
            data['name'].append(current_speaker)
            data['year'].append(2016)
            data['dem/rep'].append('rep' if current_speaker == 'TRUMP' else 'dem')
            data['left/right'].append('right' if current_speaker == 'TRUMP' else 'left')
            data['content'].append(re.sub(r'^TRUMP:|CLINTON:', '', paragraph.get_text()))

    prez_df_2016_2 = pd.DataFrame(data)

    # Print the DataFrame
    print(prez_df_2016_2)
    print(prez_df_2016_2.shape)



       name  year dem/rep left/right  \
0     TRUMP  2016     rep      right   
1     TRUMP  2016     rep      right   
2     TRUMP  2016     rep      right   
3     TRUMP  2016     rep      right   
4   CLINTON  2016     dem       left   
5   CLINTON  2016     dem       left   
6   CLINTON  2016     dem       left   
7   CLINTON  2016     dem       left   
8   CLINTON  2016     dem       left   
9   CLINTON  2016     dem       left   
10    TRUMP  2016     rep      right   
11    TRUMP  2016     rep      right   
12  CLINTON  2016     dem       left   
13  CLINTON  2016     dem       left   

                                              content  
0                                              Right.  
1    Well, if that would happen, because I am pro-...  
2    If they overturned it, it will go back to the...  
3    Well, if we put another two or perhaps three ...  
4    Well, I strongly support Roe v. Wade, which g...  
5   So many states are putting very stringent regu...  
6   Don

In [10]:
#Scraped Data for Vice Presidential Election 2012 (Biden and Ryan)

#October 11, 2012

URL = "https://www.debates.org/voter-education/debate-transcripts/october-11-2012-the-biden-romney-vice-presidential-debate/"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
all_paragraphs = soup.find_all('p')

# Find the index of the paragraph containing "RYAN: I don’t see how a"
ryan_paragraph_index = next((index for index, p in enumerate(all_paragraphs) if 'RYAN: I don’t see how a' in p.get_text()), None)

# Find the indices of paragraphs containing the word "abortion" after the specified paragraph
abortion_indices = [index for index, p in enumerate(all_paragraphs[ryan_paragraph_index:]) if 'abortion' in p.get_text().lower()]

if abortion_indices:
    # Adjust the start index to include the paragraph containing "RYAN: I don’t see how a"
    start_index = ryan_paragraph_index
    end_index = ryan_paragraph_index + abortion_indices[-1]
    
    paragraphs_between_abortion = all_paragraphs[start_index:end_index + 1]

    # Create a DataFrame with columns for 'name', 'year', 'dem/rep', 'left/right', 'content'
    data = {'name': [], 'year': [], 'dem/rep': [], 'left/right': [], 'content': []}

    current_speaker = None

    for paragraph in paragraphs_between_abortion:
        # Skip paragraphs that start with "RADDATZ:"
        if not paragraph.get_text().startswith("RADDATZ:"):
            # Determine 'name' and 'dem/rep' based on the speaker
            if 'RYAN:' in paragraph.get_text():
                current_speaker = 'RYAN'
            elif 'BIDEN:' in paragraph.get_text():
                current_speaker = 'BIDEN'
            
            data['name'].append(current_speaker)
            data['year'].append(2012)
            data['dem/rep'].append('rep' if current_speaker == 'RYAN' else 'dem')
            data['left/right'].append('right' if current_speaker == 'RYAN' else 'left')
            data['content'].append(re.sub(r'^RYAN:|BIDEN:', '', paragraph.get_text()))

    prez_df_2012 = pd.DataFrame(data)

    # Print the DataFrame
    print(prez_df_2012)
    print(prez_df_2012.shape)

     name  year dem/rep left/right  \
0    RYAN  2012     rep      right   
1    RYAN  2012     rep      right   
2    RYAN  2012     rep      right   
3    RYAN  2012     rep      right   
4    RYAN  2012     rep      right   
5   BIDEN  2012     dem       left   
6   BIDEN  2012     dem       left   
7   BIDEN  2012     dem       left   
8    RYAN  2012     rep      right   
9    RYAN  2012     rep      right   
10  BIDEN  2012     dem       left   
11  BIDEN  2012     dem       left   
12   RYAN  2012     rep      right   
13   RYAN  2012     rep      right   
14  BIDEN  2012     dem       left   

                                              content  
0    I don’t see how a person can separate their p...  
1    Now, you want to ask basically why I’m pro-li...  
2   You know, I think about 10 1/2 years ago, my w...  
3   That’s why — those are the reasons why I’m pro...  
4   Our church should not have to sue our federal ...  
5    My religion defines who I am, and I’ve been a...  

In [11]:
#Scraped Data for Presidential Election 2008 (Third McCain and Obama Debate)

#October 15, 2008

URL = "https://www.debates.org/voter-education/debate-transcripts/october-15-2008-debate-transcript/" 
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
all_paragraphs = soup.find_all('p')

# Find the indices of paragraphs containing the word "abortion"
abortion_indices = [index for index, p in enumerate(all_paragraphs) if 'abortion' in p.get_text().lower()]

if abortion_indices:
    # Extract paragraphs between the first and last mention of "abortion"
    start_index = abortion_indices[0]
    end_index = abortion_indices[-1]
    
    paragraphs_between_abortion = all_paragraphs[start_index:end_index + 1]

    # Create a DataFrame with columns for 'name', 'year', 'dem/rep', 'left/right', 'content'
    data = {'name': [], 'year': [], 'dem/rep': [], 'left/right': [], 'content': []}

    current_speaker = None

    for paragraph in paragraphs_between_abortion:
        # Skip paragraphs that start with "SCHIEFFER::"
        if not paragraph.get_text().startswith("SCHIEFFER:"):
            # Determine 'name' and 'dem/rep' based on the speaker
            if 'MCCAIN:' in paragraph.get_text():
                current_speaker = 'MCCAIN'
            elif 'OBAMA:' in paragraph.get_text():
                current_speaker = 'OBAMA'
            
            data['name'].append(current_speaker)
            data['year'].append(2008)
            data['dem/rep'].append('rep' if current_speaker == 'MCCAIN' else 'dem')
            data['left/right'].append('right' if current_speaker == 'MCCAIN' else 'left')
            data['content'].append(re.sub(r'^MCCAIN:|OBAMA:', '', paragraph.get_text()))

    prez_df_2008 = pd.DataFrame(data)

    # Print the DataFrame
    print(prez_df_2008)
    print(prez_df_2008.shape)

      name  year dem/rep left/right  \
0   MCCAIN  2008     rep      right   
1    OBAMA  2008     dem       left   
2    OBAMA  2008     dem       left   
3    OBAMA  2008     dem       left   
4    OBAMA  2008     dem       left   
5    OBAMA  2008     dem       left   
6    OBAMA  2008     dem       left   
7    OBAMA  2008     dem       left   
8    OBAMA  2008     dem       left   
9    OBAMA  2008     dem       left   
10  MCCAIN  2008     rep      right   
11  MCCAIN  2008     rep      right   
12  MCCAIN  2008     rep      right   
13  MCCAIN  2008     rep      right   
14  MCCAIN  2008     rep      right   
15  MCCAIN  2008     rep      right   
16  MCCAIN  2008     rep      right   
17   OBAMA  2008     dem       left   
18   OBAMA  2008     dem       left   
19   OBAMA  2008     dem       left   
20   OBAMA  2008     dem       left   
21   OBAMA  2008     dem       left   
22   OBAMA  2008     dem       left   
23   OBAMA  2008     dem       left   
24   OBAMA  2008     dem 

In [12]:
#Scraped Data for Presidential Election 2004 (Bush and Kerry)

#October 13, 2004

URL = "https://www.debates.org/voter-education/debate-transcripts/october-13-2004-debate-transcript/" 
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
all_paragraphs = soup.find_all('p')

# Find the index of the paragraph containing "RYAN: I don’t see how a"
kerry_paragraph_index = next((index for index, p in enumerate(all_paragraphs) if 'KERRY: I respect their views.' in p.get_text()), None)

# Find the indices of paragraphs containing the word "abortion" after the specified paragraph
abortion_indices = [index for index, p in enumerate(all_paragraphs[kerry_paragraph_index:]) if 'abortion' in p.get_text().lower()]

if abortion_indices:
    # Adjust the start index to include the paragraph containing "KERRY: I respect their views."
    start_index = kerry_paragraph_index
    end_index = kerry_paragraph_index + abortion_indices[-1]
    
    paragraphs_between_abortion = all_paragraphs[start_index:end_index + 1]

    # Create a DataFrame with columns for 'name', 'year', 'dem/rep', 'left/right', 'content'
    data = {'name': [], 'year': [], 'dem/rep': [], 'left/right': [], 'content': []}

    current_speaker = None

    for paragraph in paragraphs_between_abortion:
        # Skip paragraphs that start with "SCHIEFFER"
        if not paragraph.get_text().startswith("SCHIEFFER"):
            # Determine 'name' and 'dem/rep' based on the speaker
            if 'BUSH:' in paragraph.get_text():
                current_speaker = 'BUSH'
            elif 'KERRY:' in paragraph.get_text():
                current_speaker = 'KERRY'
            
            data['name'].append(current_speaker)
            data['year'].append(2004)
            data['dem/rep'].append('rep' if current_speaker == 'BUSH' else 'dem')
            data['left/right'].append('right' if current_speaker == 'BUSH' else 'left')
            data['content'].append(re.sub(r'^BUSH:|KERRY:', '', paragraph.get_text()))

    prez_df_2004 = pd.DataFrame(data)

    # Print the DataFrame
    print(prez_df_2004)
    print(prez_df_2004.shape)

     name  year dem/rep left/right  \
0   KERRY  2004     dem       left   
1   KERRY  2004     dem       left   
2   KERRY  2004     dem       left   
3   KERRY  2004     dem       left   
4   KERRY  2004     dem       left   
5   KERRY  2004     dem       left   
6   KERRY  2004     dem       left   
7   KERRY  2004     dem       left   
8   KERRY  2004     dem       left   
9   KERRY  2004     dem       left   
10  KERRY  2004     dem       left   
11  KERRY  2004     dem       left   
12  KERRY  2004     dem       left   
13   BUSH  2004     rep      right   
14   BUSH  2004     rep      right   
15   BUSH  2004     rep      right   
16   BUSH  2004     rep      right   

                                              content  
0    I respect their views. I completely respect t...  
1   I believe that I can’t legislate or transfer t...  
2   I believe that choice is a woman’s choice. It’...  
3   Now, I will not allow somebody to come in and ...  
4   The president has never said wh

In [13]:
#Scraped Data for Presidential Election 2004 (Bush and Kerry)

#October 8, 2004

URL = "https://www.debates.org/voter-education/debate-transcripts/october-8-2004-debate-transcript/" 
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
all_paragraphs = soup.find_all('p')

# Find the index of the paragraph containing "RYAN: I don’t see how a"
starting_index = next((index for index, p in enumerate(all_paragraphs) if 'DEGENHART: Senator Kerry, suppose' in p.get_text()), None)

# Find the indices of paragraphs containing the word "abortion" after the specified paragraph
abortion_indices = [index for index, p in enumerate(all_paragraphs[starting_index:]) if 'abortion' in p.get_text().lower()]

if abortion_indices:
    end_index = starting_index + abortion_indices[-1]
    
    paragraphs_between_abortion = all_paragraphs[starting_index:end_index + 1]

    # Create a DataFrame with columns for 'name', 'year', 'dem/rep', 'left/right', 'content'
    data = {'name': [], 'year': [], 'dem/rep': [], 'left/right': [], 'content': []}

    current_speaker = None

    for paragraph in paragraphs_between_abortion:
        # Skip paragraphs that start with "DEGENHART:"
        if not paragraph.get_text().startswith("DEGENHART:") and not paragraph.get_text().startswith("GIBSON:"):
            # Determine 'name' and 'dem/rep' based on the speaker
            if 'BUSH:' in paragraph.get_text():
                current_speaker = 'BUSH'
            elif 'KERRY:' in paragraph.get_text():
                current_speaker = 'KERRY'
            
            data['name'].append(current_speaker)
            data['year'].append(2004)
            data['dem/rep'].append('rep' if current_speaker == 'BUSH' else 'dem')
            data['left/right'].append('right' if current_speaker == 'BUSH' else 'left')
            data['content'].append(re.sub(r'^BUSH:|KERRY:', '', paragraph.get_text()))

    prez_df_2004_2 = pd.DataFrame(data)

    # Print the DataFrame
    print(prez_df_2004_2)
    print(prez_df_2004_2.shape)

     name  year dem/rep left/right  \
0   KERRY  2004     dem       left   
1   KERRY  2004     dem       left   
2   KERRY  2004     dem       left   
3   KERRY  2004     dem       left   
4   KERRY  2004     dem       left   
5   KERRY  2004     dem       left   
6   KERRY  2004     dem       left   
7   KERRY  2004     dem       left   
8   KERRY  2004     dem       left   
9   KERRY  2004     dem       left   
10   BUSH  2004     rep      right   
11   BUSH  2004     rep      right   
12   BUSH  2004     rep      right   
13   BUSH  2004     rep      right   
14   BUSH  2004     rep      right   
15   BUSH  2004     rep      right   
16   BUSH  2004     rep      right   
17   BUSH  2004     rep      right   
18   BUSH  2004     rep      right   
19   BUSH  2004     rep      right   
20   BUSH  2004     rep      right   
21   BUSH  2004     rep      right   
22  KERRY  2004     dem       left   
23  KERRY  2004     dem       left   
24  KERRY  2004     dem       left   
25   BUSH  2

In [14]:
#Scraped Data for Presidential Election 2000 (Gore and Bush)

#October 8, 2004

URL = "https://www.debates.org/voter-education/debate-transcripts/october-3-2000-transcript/" 
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
all_paragraphs = soup.find_all('p')

# Find the indices of paragraphs containing the word "abortion"
abortion_indices = [index for index, p in enumerate(all_paragraphs) if 'abortion' in p.get_text().lower()]

if abortion_indices:
    # Extract paragraphs between the first and last mention of "abortion"
    start_index = abortion_indices[0]
    end_index = abortion_indices[-1]
    
    paragraphs_between_abortion = all_paragraphs[start_index:end_index + 1]

    # Create a DataFrame with columns for 'name', 'year', 'dem/rep', 'left/right', 'content'
    data = {'name': [], 'year': [], 'dem/rep': [], 'left/right': [], 'content': []}

    current_speaker = None

    for paragraph in paragraphs_between_abortion:
        # Skip paragraphs that start with "MODERATOR::"
        if not paragraph.get_text().startswith("MODERATOR:"):
            # Determine 'name' and 'dem/rep' based on the speaker
            if 'BUSH:' in paragraph.get_text():
                current_speaker = 'BUSH'
            elif 'GORE:' in paragraph.get_text():
                current_speaker = 'GORE'
            
            data['name'].append(current_speaker)
            data['year'].append(2000)
            data['dem/rep'].append('rep' if current_speaker == 'BUSH' else 'dem')
            data['left/right'].append('right' if current_speaker == 'BUSH' else 'left')
            data['content'].append(re.sub(r'^BUSH:|GORE:', '', paragraph.get_text()))

    prez_df_2000 = pd.DataFrame(data)

    # Print the DataFrame
    print(prez_df_2000)
    print(prez_df_2000.shape)

   name  year dem/rep left/right  \
0  BUSH  2000     rep      right   
1  GORE  2000     dem       left   

                                             content  
0   I don’t think a president can do that. I was ...  
1   Well, Jim, the FDA took 12 years, and I do su...  
(2, 5)


#### Merge Dataframes 

Make sure there are 140 rows and 5 columns. This will be the merged dataframe for all the debates.

In [15]:
prez_merged = pd.concat([prez_df_2020, prez_df_2016, prez_df_2016_2, prez_df_2012, prez_df_2008, prez_df_2004_2, prez_df_2004, prez_df_2000], ignore_index=True)
print(prez_merged)
print(prez_merged.shape)

       name  year dem/rep left/right  \
0     PENCE  2020     rep      right   
1     PENCE  2020     rep      right   
2     PENCE  2020     rep      right   
3    HARRIS  2020     dem       left   
4     PENCE  2020     rep      right   
..      ...   ...     ...        ...   
135    BUSH  2004     rep      right   
136    BUSH  2004     rep      right   
137    BUSH  2004     rep      right   
138    BUSH  2000     rep      right   
139    GORE  2000     dem       left   

                                               content  
0     Well thank you for the question, but I’ll use...  
1     My hope is that when the hearing takes place,...  
2     – treated respectfully and voted and confirme...  
3     Thank you, Susan. First of all, Joe Biden and...  
4     Well, thank you, Susan. Let me just say, addr...  
..                                                 ...  
135  Take, for example, the ban on partial birth ab...  
136  What I’m saying is, is that as we promote life...  
137  T

#### Analyze Top Words Used in Presidential Debate (Left)

In [44]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

# Filter rows where 'dem/rep' is 'dem'
dem_df = prez_merged[prez_merged['dem/rep'] == 'dem']

# Combine all content into a single string
left_content = ' '.join(dem_df['content'])

#custom stopwords
custom_stopwords = ['president','think','people','wade','said','make','would','know','abortion']

# Create a CountVectorizer
vectorizer = CountVectorizer(stop_words=stopwords.words('english')+custom_stopwords)

# Fit and transform the content
word_matrix = vectorizer.fit_transform([left_content])

# Get feature names (words)
left_words = vectorizer.get_feature_names_out()

# Get word counts
word_counts = word_matrix.toarray()[0]

# Create a DataFrame with words and counts
word_df = pd.DataFrame({'word': left_words, 'count': word_counts})

# Sort DataFrame by count in descending order
word_df = word_df.sort_values(by='count', ascending=False)

# Print the top words
print(word_df.head(10))

         word  count
693     women     22
537     right     16
360      life     15
233     faith     15
543       roe     15
323     issue     12
158     court     12
692     woman     12
109  catholic     12
116    choice     11


### Analyze Top Words Used in Presidential Debate (Right)

In [43]:
# Filter rows where 'dem/rep' is 'rep'
rep_df = prez_merged[prez_merged['dem/rep'] == 'rep']

# Combine all content into a single string
right_content = ' '.join(rep_df['content'])

#custom stopwords
custom_stopwords = ['abortion','know','think','would','people','senator','abortions']

# Create a CountVectorizer
vectorizer = CountVectorizer(stop_words=stopwords.words('english')+custom_stopwords)

# Fit and transform the content
word_matrix = vectorizer.fit_transform([right_content])

# Get feature names (words)
right_words = vectorizer.get_feature_names_out()

# Get word counts
word_counts = word_matrix.toarray()[0]

# Create a DataFrame with words and counts
word_df = pd.DataFrame({'word': right_words , 'count': word_counts})

# Sort DataFrame by count in descending order
word_df = word_df.sort_values(by='count', ascending=False)

# Print the top words
print(word_df.head(10))

        word  count
307     life     45
31   america     16
417      pro     15
133    court     12
581    voted     10
77     birth     10
422  promote     10
520  supreme      9
99     child      9
508   states      9


In [51]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

def overall_sentiment(text,analyzer=SentimentIntensityAnalyzer()):  
    
    sentiment_dict = analyzer.polarity_scores(text)
    comp = sentiment_dict['compound']
    
    #overall sentiment
    if comp >= 0.05: 
        sentiment = 'positive'
    elif comp <= -0.05: 
        sentiment = 'negative'
    else: 
        sentiment = 'neutral'
    sentiment_dict['Overall sentiment'] = sentiment 
    return sentiment_dict 

print("Overall Sentiment for Left:")
print(overall_sentiment(left_content))

print("\nOverall Sentiment for Right:")
print(overall_sentiment(right_content))

Overall Sentiment for Left:
{'neg': 0.077, 'neu': 0.782, 'pos': 0.141, 'compound': 0.9997, 'Overall sentiment': 'positive'}

Overall Sentiment for Right:
{'neg': 0.066, 'neu': 0.743, 'pos': 0.192, 'compound': 0.9999, 'Overall sentiment': 'positive'}


#### Analysis of **VADER**

Both the left and right show highly positive VADER sentiment analysis scores. This makes sense consiering the context of 

### Data Cleaning - Congressional Bills 

https://data.stanford.edu/congress_text
https://www.congress.gov/search?q=%7B%22congress%22%3A%5B%22118%22%5D%2C%22source%22%3A%5B%22legislation%22%2C%22committee-meetings%22%2C%22senate-communications%22%2C%22house-communications%22%5D%2C%22search%22%3A%22abortion%22%2C%22party%22%3A%22Republican%22%2C%22type%22%3A%22bills%22%7D
Filtering Legislative bils, republican bills, 

In [77]:
def scrape_and_combine_html(urls):
    # Initialize an empty string to store the combined HTML content
    combined_html_content = ''

    # Iterate through each URL in the list
    for url in urls:
        # Fetch the HTML content from the current URL
        response = requests.get(url)
        html_content = response.content

        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(html_content, 'html.parser')

        # Extract text content from the HTML
        text_content = soup.get_text(separator=' ', strip=True)

        # Append the extracted text content to the result
        combined_html_content += text_content + ' '

    return combined_html_content

### Data Cleaning - Congress Bills (Republican)

In [78]:
rep_congress_urls = [
    "https://www.congress.gov/118/bills/hr106/BILLS-118hr106ih.xml",
    "https://www.congress.gov/118/bills/hr7/BILLS-118hr7ih.xml",
    "https://www.congress.gov/118/bills/hr862/BILLS-118hr862ih.xml",
    "https://www.congress.gov/118/bills/hr1143/BILLS-118hr1143ih.xml",
    "https://www.congress.gov/118/bills/hr792/BILLS-118hr792ih.xml",
    "https://www.congress.gov/118/bills/hr330/BILLS-118hr330ih.xml",
    "https://www.congress.gov/118/bills/hr1470/BILLS-118hr1470ih.xml",
    "https://www.congress.gov/118/bills/hr384/BILLS-118hr384ih.xml",
    "https://www.congress.gov/118/bills/hr26/BILLS-118hr26pcs.xml",
    "https://www.congress.gov/118/bills/hr632/BILLS-118hr632ih.xml",
    "https://www.congress.gov/118/bills/hr1297/BILLS-118hr1297ih.xml",
    "https://www.congress.gov/118/bills/hr73/BILLS-118hr73ih.xml",
    "https://www.congress.gov/118/bills/hr416/BILLS-118hr416ih.xml",
    "https://www.congress.gov/118/bills/hr383/BILLS-118hr383ih.xml",
    "https://www.congress.gov/118/bills/hr5806/BILLS-118hr5806ih.xml",
    "https://www.congress.gov/118/bills/hr5319/BILLS-118hr5319ih.xml",
    "https://www.congress.gov/118/bills/hr6459/BILLS-118hr6459ih.xml",
    "https://www.congress.gov/118/bills/hr6460/BILLS-118hr6460ih.xml",
    "https://www.congress.gov/118/bills/hr3741/BILLS-118hr3741ih.xml",
    "https://www.congress.gov/118/bills/hr4672/BILLS-118hr4672ih.xml",
    "https://www.congress.gov/118/bills/hr983/BILLS-118hr983ih.xml",
    "https://www.congress.gov/118/bills/hr3741/BILLS-118hr3741ih.xml",
    "https://www.congress.gov/118/bills/hr421/BILLS-118hr421ih.xml",
    "https://www.congress.gov/118/bills/hr435/BILLS-118hr435ih.xml",
    "https://www.congress.gov/118/bills/hr175/BILLS-118hr175ih.xml",
    "https://www.congress.gov/118/bills/hr116/BILLS-118hr116ih.xml"
    # Add more HTML URLs as needed
]

rep_congress_content = scrape_and_combine_html(rep_congress_urls )

# Print or use the combined content as needed
print(rep_congress_content)



118 HR 106 IH: Abortion Is Not Health Care Act of 2023 U.S. House of Representatives 2023-01-09 text/xml EN Pursuant to Title 17 Section 105 of the United States Code, this file is not subject to copyright protection and is in the public domain. I 118th CONGRESS 1st Session H. R. 106 IN THE HOUSE OF REPRESENTATIVES January 9, 2023 Mr. Biggs introduced the following bill; which was referred to the Committee on Ways and Means A BILL To amend the Internal Revenue Code of 1986 to provide that amounts paid for an abortion are not taken into account for purposes of the deduction for medical expenses. 1. Short title This Act may be cited as the Abortion Is Not Health Care Act of 2023 . 2. Amounts paid for abortion not taken into account in determining deduction for medical expenses (a) In general Section 213 of the Internal Revenue Code of 1986 is amended by adding at the end the following new subsection: (f) Amounts paid for abortion not taken into account An amount paid during the taxable y

#### Vader Sentiment for Congress Bills (Right)

In [79]:
overall_sentiment(rep_congress_content)

{'neg': 0.069,
 'neu': 0.859,
 'pos': 0.072,
 'compound': -0.9975,
 'Overall sentiment': 'negative'}

#### Top Words Count for Congressional Bills (Right)

In [86]:
custom_stopwords = ['mr','abortion','section', 'title','act','subsections','shall','following','code','2023','subsection','mrs','may','house',
        'representatives','amended','bill']

vectorizer = CountVectorizer(stop_words=stopwords.words('english')+custom_stopwords)

# Fit and transform the content
word_matrix = vectorizer.fit_transform([rep_congress_content])

# Get feature names (words)
right_words = vectorizer.get_feature_names_out()

# Get word counts
word_counts = word_matrix.toarray()[0]

# Create a DataFrame with words and counts
word_df = pd.DataFrame({'word': right_words , 'count': word_counts})

# Sort DataFrame by count in descending order
word_df = word_df.sort_values(by='count', ascending=False)

# Print the top words
print(word_df.head(20))

           word  count
1466     states    103
786      health    101
152   abortions     97
371       child     89
1593     united     89
1537       term     73
1464      state     72
1651      woman     66
1159   physical     65
691     federal     62
1029      minor     60
1201  pregnancy     58
1272     public     57
1012      means     56
1144  performed     55
953        life     54
369    chemical     53
1590     unborn     53
1425   services     51
479    coverage     48


#### Data Cleaning - Congressional Bills (Democrat)

In [88]:
dem_congress_urls = [
    "https://www.congress.gov/118/bills/hr767/BILLS-118hr767ih.xml",
    "https://www.congress.gov/118/bills/hr2573/BILLS-118hr2573ih.xml",
    "https://www.congress.gov/118/bills/hr55/BILLS-118hr55ih.xml",
    "https://www.congress.gov/118/bills/hr4303/BILLS-118hr4303ih.xml",
    "https://www.congress.gov/118/bills/hr1723/BILLS-118hr1723ih.xml",
    "https://www.congress.gov/118/bills/hr561/BILLS-118hr561ih.xml",
    "https://www.congress.gov/118/bills/hr3132/BILLS-118hr3132ih.xml",
    "https://www.congress.gov/118/bills/hr12/BILLS-118hr12ih.xml",
    "https://www.congress.gov/118/bills/hr782/BILLS-118hr782ih.xml",
    "https://www.congress.gov/118/bills/hr1224/BILLS-118hr1224ih.xml",
    "https://www.congress.gov/118/bills/hr2736/BILLS-118hr2736ih.xml",
    "https://www.congress.gov/118/bills/hr6357/BILLS-118hr6357ih.xml",
    "https://www.congress.gov/118/bills/hr4796/BILLS-118hr4796ih.xml",
    "https://www.congress.gov/118/bills/hr4147/BILLS-118hr4147ih.xml",
    "https://www.congress.gov/118/bills/hr4268/BILLS-118hr4268ih.xml",
    "https://www.congress.gov/118/bills/hr4418/BILLS-118hr4418ih.xml",
    "https://www.congress.gov/118/bills/hr2907/BILLS-118hr2907ih.xml",
    "https://www.congress.gov/118/bills/hr445/BILLS-118hr445ih.xml",
    "https://www.congress.gov/118/bills/hr459/BILLS-118hr459ih.xml",
    "https://www.congress.gov/118/bills/hr4281/BILLS-118hr4281ih.xml",
    "https://www.congress.gov/118/bills/hr5008/BILLS-118hr5008ih.xml",
    "https://www.congress.gov/118/bills/hr3659/BILLS-118hr3659ih.xml",
    "https://www.congress.gov/118/bills/hr3421/BILLS-118hr3421ih.xml",
    "https://www.congress.gov/118/bills/hr6298/BILLS-118hr6298ih.xml",
    "https://www.congress.gov/118/bills/hr4901/BILLS-118hr4901ih.xml",
    "https://www.congress.gov/118/bills/hr3420/BILLS-118hr3420ih.xml", 
    "https://www.congress.gov/118/bills/hr4121/BILLS-118hr4121ih.xml",  
    "https://www.congress.gov/118/bills/hr62/BILLS-118hr62ih.xml",
    "https://www.congress.gov/118/bills/hr6298/BILLS-118hr6298ih.xml", 
    "https://www.congress.gov/118/bills/hr6270/BILLS-118hr6270ih.xml", 
    "https://www.congress.gov/118/bills/hr4329/BILLS-118hr4329ih.xml"

    # Add more HTML URLs as needed
]

dem_congress_content= scrape_and_combine_html(dem_congress_urls )

# Print or use the combined content as needed
print(dem_congress_content)



118 HR 767 IH: Protecting Access to Medication Abortion Act of 2023 U.S. House of Representatives 2023-02-02 text/xml EN Pursuant to Title 17 Section 105 of the United States Code, this file is not subject to copyright protection and is in the public domain. I 118th CONGRESS 1st Session H. R. 767 IN THE HOUSE OF REPRESENTATIVES February 2, 2023 Ms. Bush (for herself, Ms. Omar , Mr. Soto , Ms. Jackson Lee , Ms. McCollum , Ms. Norton , Mr. Beyer , and Mr. Carson ) introduced the following bill; which was referred to the Committee on Energy and Commerce A BILL To preserve access to abortion medications. 1. Short title This Act may be cited as the Protecting Access to Medication Abortion Act of 2023 . 2. Modification of REMS (a) In general The Secretary of Health and Human Services (referred to in this section as the Secretary ) shall ensure that the risk evaluation and mitigation strategy under section 505–1 of the Federal Food, Drug, and Cosmetic Act ( 21 U.S.C. 355–1 ) that applies to m

In [89]:
overall_sentiment(dem_congress_content)

{'neg': 0.041,
 'neu': 0.831,
 'pos': 0.128,
 'compound': 1.0,
 'Overall sentiment': 'positive'}

In [91]:
custom_stopwords = ['mr','ms','including','act','secretary','title','section','subsection','shall','42','paragraph','described','may','information']

vectorizer = CountVectorizer(stop_words=stopwords.words('english')+custom_stopwords)

# Fit and transform the content
word_matrix = vectorizer.fit_transform([dem_congress_content])

# Get feature names (words)
right_words = vectorizer.get_feature_names_out()

# Get word counts
word_counts = word_matrix.toarray()[0]

# Create a DataFrame with words and counts
word_df = pd.DataFrame({'word': right_words , 'count': word_counts})

# Sort DataFrame by count in descending order
word_df = word_df.sort_values(by='count', ascending=False)

# Print the top words
print(word_df.head(20))

              word  count
2094        health   1133
925           care    840
3685      services    665
453       abortion    514
3805         state    378
3492  reproductive    315
3808        states    302
3270      provider    265
2262    individual    242
4142        united    240
2265   individuals    223
1715        entity    216
467         access    207
1882       federal    205
2002       general    196
3268       provide    179
3223       program    179
3976          term    176
3030        people    171
3271     providers    158


## 3. Methods


- Use Fightingwords to examine partisan difference in news outlets (FIGHTING WORDS LECTURE 10, use given library, used in PSET 3) - look at words 
- Use VADER on the presidential debates + congressional speech phrasings (VADER WAS FOUND IN PSET 1)
- Use (k-means) Clustering of phrasings (apply to all) (FOUND IN PSET 2, LECTURE 6, 7)
- Use BERT Classification (PSET 5, Lecture 15, 16, article discussed)
- When dealing with CENTER, we would use VADER to examine the connotation of words used within the outlet and we also compare that to the fightingwords identified for left and right and examine the distribution of those words within that outlet
- 

## 4. Results

### DEADLINE: THURS 12/7 (Midnight)

## 5. Discussion and conclusions

### DEADLINE: 12/8