# North Korean News

Scrape the North Korean news agency http://kcna.kp

Save a CSV called `nk-news.csv`. This file should include:

* The **article headline**
* The value of **`onclick`** (they don't have normal links)
* The **article ID** (for example, the article ID for `fn_showArticle("AR0125885", "", "NT00", "L")` is `AR0125885`

The last part is easiest using pandas. Be sure you don't save the index!

* _**Tip:** If you're using requests+BeautifulSoup, you can always look at response.text to see if the page looks like what you think it looks like_
* _**Tip:** Check your URL to make sure it is what you think it should be!_
* _**Tip:** Does it look different if you scrape with BeautifulSoup compared to if you scrape it with Selenium?_
* _**Tip:** For the last part, how do you pull out part of a string from a longer string?_
* _**Tip:** `expand=False` is helpful if you want to assign a single new column when extracting_
* _**Tip:** `(` and `)` mean something special in regular expressions, so you have to say "no really seriously I mean `(`" by using `\(` instead_
* _**Tip:** if your `.*` is taking up too much stuff, you can try `.*?` instead, which instead of "take as much as possible" it means "take only as much as needed"_

In [7]:
import requests
from bs4 import BeautifulSoup

import re

import pandas as pd

response = requests.get('http://kcna.kp/kcna.user.home.retrieveHomeInfoList.kcmsf;jsessionid=986E6911A7A5C7099544024C67C74C1A')
soup_doc = BeautifulSoup(response.text)

In [5]:
headlines = (soup_doc.find_all('a', class_='titlebet'))
for headline in headlines:
    print(headline.text)
    

onclicks = (soup_doc.find_all('a', class_='titlebet'))
for onclick in onclicks:
    print(headline['onclick'])

Committees for Election of Deputies to Province (Municipality), City (District) and County Assemblies Start Work in DPRK
Greetings to Seychellois President
Venezuelan City's Highest Order Awarded to Supreme Leader Kim Jong Un
Xi Jinping to Visit DPRK
Kim Jae Ryong Inspects Tokchon Area Coal-Mining Complex
Double-Dealing Tactics Will Never Work: KCNA Commentary
Central Election Guidance Committee Formed in DPRK
Congratulations to President of Kazakhstan
Greetings to Philippine President
KCNA Commentary Rebukes Japan's Attachment to "Flag of Rising Sun Shedding Rays"
Results of 2018-2019 DPRK Premier Football League (15)
Results of 2018-2019 DPRK Women's Premier Football League (13)
Results of AFC Cup
Results of 2018-2019 DPRK Premier Football League (14)
Road Relay Held in DPRK
Pyongyang Int'l Sci-Tech Exhibition of Health and Medical Appliances Closes
Pyongyang International Sci-Tech Exhibition of Health and Medical Appliances Opens
Tour by Super-light Planes Popular in DPRK
MOU Signed

In [9]:
rows = []

for headline in headlines:
    
    row = {}
    try:
        row['Headline'] = headline.text
    except:
        pass
    try:
        row['Onclick value'] = headline['onclick']
    except:
        pass
    
    rows.append(row)

df = pd.DataFrame(rows)
df

Unnamed: 0,Headline,Onclick value
0,Committees for Election of Deputies to Provinc...,"fn_showArticle(""AR0126485"", """", ""NT21"", ""L"")"
1,Greetings to Seychellois President,"fn_showArticle(""AR0126295"", """", ""NT21"", ""L"")"
2,Venezuelan City's Highest Order Awarded to Sup...,"fn_showArticle(""AR0126287"", """", ""NT21"", ""L"")"
3,Xi Jinping to Visit DPRK,"fn_showArticle(""AR0126286"", """", ""NT21"", ""L"")"
4,Kim Jae Ryong Inspects Tokchon Area Coal-Minin...,"fn_showArticle(""AR0126281"", """", ""NT21"", ""L"")"
5,Double-Dealing Tactics Will Never Work: KCNA C...,"fn_showArticle(""AR0126277"", """", ""NT21"", ""L"")"
6,Central Election Guidance Committee Formed in ...,"fn_showArticle(""AR0126169"", """", ""NT21"", ""L"")"
7,Congratulations to President of Kazakhstan,"fn_showArticle(""AR0126127"", """", ""NT21"", ""L"")"
8,Greetings to Philippine President,"fn_showArticle(""AR0126118"", """", ""NT21"", ""L"")"
9,KCNA Commentary Rebukes Japan's Attachment to ...,"fn_showArticle(""AR0126117"", """", ""NT21"", ""L"")"


In [11]:
df.to_csv("nk-news.csv", index=False)