## Extract only hard news

For project analyzing the evaluative content of quotes vs. non-quotes.

This notebook parses URLs and gets rid of "first person", "analysis", and "opinion". This only works for CBC articles, which have a clear(ish) structure in the URLs.

### Some notes on the method

The rule basically looks for "first-person" with either a slash before or a hyphen after in the URL. This catches most cases, where the structure is one of the following:

`63ba88f5f642cd45622d61fc,https://www.cbc.ca/news/canada/calgary/dave-cheke-first-person-hostage-1.6686436`

`63c3c395f642cd456271cb7f,https://www.cbc.ca/news/canada/first-person-winter-s-darkness-1.6702896`

There seem to be some first-person essays that do not have "first-person" in the URL. Those have been included, because searching for the string "first-person" in the text would exclude other, legitimate articles. The decision is to include the first article below, which is first-person opinion. But the rule allows us to include the second, legitimate news article.

Should be out:


`63b40d36f642cd4562015142,https://www.cbc.ca/news/canada/manitoba/climate-crisis-should-i-have-kids-michaela-keegan-1.6653649,"CBC Manitoba's Creator Network asked gen-Zers and millennials to contemplate the choice to have children.
[...]
Should I Have Kids? is a new series that launches with this first-person essay by Michaela Keegan, a young Winnipegger contemplating her choice to have children or not.`

Should be in:

`3d58217f642cd4562f6943e,https://www.cbc.ca/news/world/tyre-nichols-memphis-police-beating-video-questions-1.6729731,
[...]
As Nichols is slumped up against a car, not one of the officers renders aid. The body camera footage shows a first-person view of one of them reaching down and tying his shoe.`

For analysis, the string in the URL seems to be either `/analysis-` or `-analysis-`.

For opinion, the situation is slightly different. The true opinion pieces seem to be `/opinion`. After analysis of many instances of `-opinion-`, I have concluded that the stories refer to opinion polls or public opinion. They are not opinion pieces. So the procedure only removes `/opinion`. 

`63c0551df642cd4562586cd4,https://www.cbc.ca/news/world/prince-harry-spare-sales-opinion-poll-1.6711510,`

`King Charles and Prince William made their first public appearances on Thursday since the release of Prince Harry's tell-all memoir — which is racking up sales but apparently hurting his once-strong popularity.`

In [None]:
import pandas as pd
import numpy as np
import json
from pandas import json_normalize
import ast
from ast import literal_eval
import os
import glob

In [None]:
os.chdir(r'C:\Maite\MOD\projects\Monika_Bednarek\Evaluation_quotes\Data\CBC_input')

## One month at a time

In [None]:
df = pd.read_csv('2023_12_dec_CBC_hardnews.csv', encoding = 'utf8')

In [None]:
df

## Extract the right information

Filter out the rows that contain:
* `/first-person`
* `-first-person-`
* `/analysis-`
* `-analysis-`
* `/opinion`



In [None]:
df_no_first = df[~df['url'].str.contains(r'/first-person|-first-person-')]

In [None]:
df_no_first

In [None]:
df_no_first_no_analysis = df_no_first[~df_no_first['url'].str.contains(r'/analysis|-analysis-')]

In [None]:
df_no_first_no_analysis

In [None]:
df_no_first_no_analysis_no_opinion = df_no_first_no_analysis[~df_no_first_no_analysis['url'].str.contains(r'/opinion')]

In [None]:
df_no_first_no_analysis_no_opinion

# Remember to change file names here

In [None]:
# CHANGE THE NAME OF THE MONTH EACH TIME!

df_no_first_no_analysis_no_opinion.to_csv('2023_12_dec_CBC_news.csv', index=False)

## Concatenate all csvs - once I have produced all 12 months

Once I have all the months, concatenate them and count.

In [None]:
extension = 'csv'

In [None]:
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

In [None]:
#combine all files in the list

CBC_news_all_2023 = pd.concat([pd.read_csv(f) for f in all_filenames])

In [None]:
CBC_news_all_2023

In [None]:
#export to csv

CBC_news_all_2023.to_csv( "CBC_news_all_2023.csv", index=False, encoding='utf-8-sig')