***

# Data science for social scientist
## A friendly introduction to some powerfull tools


***

![mario](http://i.imgur.com/0QZUW.jpg, width=600, height=600)

***

## Content mini-workshop:

* **Demonstration: data science applied to my own research**
* **Lession 1:     regular expressions**
* **Exercise 1:    develop your own regex pattern and erase centuries from a text**
* **Create a dataset: run the demonstration code yourself so we have data for the following steps**
* **Lession 2:     using Pandas for your SQL jobs and make more data from data**
* **Exercise 2:**

***

***

## Demonstration: splitting and cleaning 'raw' newspaper articles

Split batches of 200 articles downloaded from Lexis Nexis into seperate .txt files


***

***

## Lesson 1: regular expressions

Regular expressions (regex in short) are very helpfull to clean and organize data. For example, to delete each digit a dataset, or each alphabetic character. Or to search for postal codes, telephone numbers, email addresses, specific words, or sentences. In those cases you can use a regex to find this particular sequence of characters and then do something with it.
<br />
<br />
To run the code below, click in the cell, and hit Cntrl-Enter. 
To create a new cell, go to Insert in the menu above. Or click in an existing cell, and use Esc-B to create a new cell below. 
See 'More resources' in the bottom of this Notebook for other shortcuts.

***

In [1]:
# Before you start, specify in which directory you work
# Use the folder where the data (allarticles_merged.txt) is stored on your computer. 
# Don't put anything else in this folder.

import os
os.chdir('C:/Users/renswilderom/Documents/test')
path = 'C:/Users/renswilderom/Documents/test'

In [2]:
# This example uses regex to split a text in two groups
# It is based on a regex tutorial below (see 'More resources')

import re

line = "For example, we are only interested in the text after a certain key word, say Mario. This information is super relevant."

match = re.match('(.*)Mario.?(.*)', line) 
# By using the '?' symbol, the '.' after Mario becomes optional, so the pattern will match both 'Mario' and 'Mario.'

if match:
   print("Mario is found! \n") 
   print("Group 1: \n", match.group(1))  
   print("Group 2: \n", match.group(2))     

Mario is found! 

Group 1: 
 For example, we are only interested in the text after a certain key word, say 
Group 2: 
  This information is super relevant.


In [3]:
# rather than grouping your data, you can also just delete the irrelevant parts:

import re # like in the cell above, first import the regex module

line = "For example, we are only interested in the text after a certain key word, say Mario. This information is super relevant."

print(re.sub('(.*?)Mario.', '', line)) # this pattern matches on 'Mario.' and anything before it.

 This information is super relevant.


In [4]:
# you can also make a new value (or variable) called 'new_line'

import re 

line = "For example, we are only interested in the text after a certain key word, say Mario. This information is super relevant."

new_line = re.sub('(.*?)Mario.', '', line) # this pattern matches on 'Mario.' and anything before it.

print(new_line)

 This information is super relevant.


***

Instead of searching for specific words, like 'Mario', its also possible to search for any string consisting of five characters. Or anything which looks like an email address. Then you may also use special regex codes. For example, `\d` will match on any digit, `\D` any non-digit, and `\w` matches both alphabetic characters and digits. 

***

In [5]:
# Here is an example of how to use them:

line = "4444 55555 six seven"

print(re.sub('\d', '', line)) # yields everything but digits
print(re.sub('(?!\s)\D', '', line)) # yields digits (note that (?!\s) does not matches on spaces, this is a 'negative look ahead') 
print(re.sub('\w{5}', '', line)) # yields everything but five character strings

  six seven
4444 55555  
4444  six 


***
See also this comprehensive cheat sheat: http://www.rexegg.com/regex-quickstart.html
<br />
And here are the basic codes:
<br />
\d  <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Matches digit  <br />
\D  <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Matches non-digit  <br />
\s  <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Matches whitespace  <br />
\S  <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Matches non-whitespace  <br />
\w  <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Matches alphanumeric  <br />
\W  <br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Matches non-alphanumeric <br />

***

***

## Exercise 1: regular expressions

Please complete exercise A, B, and C below.

***

In [6]:
# A: print the line without any references to centuries (e.g. 16th century needs to be omitted from the line) 
# B: print without punctuation (use the world wide web to find out how).
# C: Do both A and B at once and save the output as a .txt file on your computer (use the world wide web to find out how).

# And I will be there to help!

import re

# from Wikipedia on the history of pizza
line = "The innovation that led to flat bread pizza was the use of tomato as a topping. For some time after the tomato was brought to Europe from the Americas in the 16th century, it was believed by many Europeans to be poisonous (as some other fruits of the nightshade family are). However, by the late 18th century, it was common for the poor of the area around Naples to add tomato to their yeast-based flat bread, and so the pizza began.[citation needed] The dish gained popularity, and soon pizza became a tourist attraction as visitors to Naples ventured into the poorer areas of the city to try the local specialty. Antica Pizzeria Port'Alba in Naples Until about 1830, pizza was sold from open-air stands and out of pizza bakeries, and pizzerias keep this old tradition alive today. It is possible to enjoy paper-wrapped pizza and a drink sold from open-air stands outside the premises. Antica Pizzeria Port'Alba in Naples is widely regarded as the city's first pizzeria.[21] Purists, like the famous pizzeria 'Da Michele' in Via C. Sersale (founded 1870),[22] consider there to be only two true pizzas—the marinara and the margherita—and that is all they serve. Bamberger, David; Eban, Abba Solomon (1979). My People: Abba Eban's History of the Jews, Volume 2. Behrman House. p. 228. ISBN 0874412803. ‘Food and Drink - Pide - HiTiT Turkey guide’ Hitit.co.uk. Retrieved 2009-06-05. ‘History of Pizza Margherita’. tobetravelagent.com. 2012-04-09. Retrieved 2012-04-09."



***

## Create a dataset: run the demonstration code yourself so we have data for the following steps


The loop below uses`(.*\d+\sof\s\d+\sDOCUMENTS)` to search for a common delimiter (e.g.'1 of 200 DOCUMENTS') in between every newspaper article. Then the code opens a new file, called 'article', and writes the relevant information to it. 
<br />
<br />
I merged 15 of these 'batches' of Lexis Nexis articles together in one .txt file (allarticles_merged.txt). 


***


In [7]:
import os, os.path, glob, re 
from shutil import move

text_file = open("allarticles_merged.txt","r")
lines = text_file.readlines()
k = 1
target = open("article" + str(k) + ".txt", "a")
delimiterFound = False
import re

for line in lines :
    k += 1
    line = line.lstrip() #Removes blank lines and lead blank spaces
    if delimiterFound == False:        
            m = re.search('(.*\d+\sof\s\d+\sDOCUMENTS)', line)
            if m:
                delimiterFound = True         
                target.write(line)
                target = open("article" + str(k) + ".txt", "a")
            else:
                target.write(line)             
            
    if delimiterFound == True:
            m = re.search('(.*\d+\sof\s\d+\sDOCUMENTS)', line)
            if m:
                delimiterFound = False
                target.write(line)
                target = open("article" + str(k) + ".txt", "a")
            else:
                target.write(line)  
        
target.close()
text_file.close()
os.remove("allarticles_merged.txt") # this deletes it from your computer, so make a back up before running the code.
print("I'm done with splitting the files!")
print("The dataset is ready")

I'm done with splitting the files!
The dataset is ready


In [9]:
# Write all the articles (stored as seperate .txt files) to one csv file 

# https://stackoverflow.com/questions/41913147/combine-a-folder-of-text-files-into-a-csv-with-each-content-in-a-cell

import csv
from pathlib import Path

with open('newspaper_articles.csv', 'w', encoding='UTF-8', newline='') as out_file:
    csv_out = csv.writer(out_file)
    csv_out.writerow(['FileName', 'Content'])
    for fileName in Path('.').glob('*.txt'):
        lines = [ ]
        with open(str(fileName.absolute()),'rb') as one_text:
            for line in one_text.readlines():
                lines.append(line.decode(encoding='UTF-8',errors='ignore').strip())
        csv_out.writerow([str(fileName),' '.join(lines)])
print("You now have a csv dataset, its in the bottom of your working directory")

You now have a csv dataset, its in the bottom of your working directory


In [10]:
# Open the csv again as a Pandas 'data frame' 

from pandas import DataFrame
df = DataFrame.from_csv("newspaper_articles.csv", encoding='UTF-8')
df.shape
df[:10]

Unnamed: 0_level_0,Content
FileName,Unnamed: 1_level_1
article1.txt,BYLINE: From MARK TRAN LENGTH: 224 words DATEL...
article100044.txt,88 of 200 DOCUMENTS The Guardian (London) May ...
article100080.txt,89 of 200 DOCUMENTS The Guardian (London) Apri...
article100447.txt,90 of 200 DOCUMENTS The Guardian (London) Apri...
article100510.txt,91 of 200 DOCUMENTS The Guardian (London) Apri...
article1011.txt,13 of 173 DOCUMENTS The Guardian (London) Nove...
article101215.txt,92 of 200 DOCUMENTS The Guardian (London) Apri...
article101282.txt,93 of 200 DOCUMENTS The Guardian (London) Apri...
article101346.txt,94 of 200 DOCUMENTS The Guardian (London) Apri...
article101425.txt,95 of 200 DOCUMENTS The Guardian (London) Apri...


In [11]:
# If you like to experiment more outside this class, its possible to use regex to clean the articles inside the data frame.
# here is an example
# remove all captitalized words from the column 'Content'.
df['Content'] = df['Content'].str.replace(r'\b[A-Z]+\b', '')
df[:10]

Unnamed: 0_level_0,Content
FileName,Unnamed: 1_level_1
article1.txt,: From : 224 words : America's already blea...
article100044.txt,"88 of 200 The Guardian (London) May 10, 1996 ..."
article100080.txt,"89 of 200 The Guardian (London) April 27, 199..."
article100447.txt,"90 of 200 The Guardian (London) April 27, 199..."
article100510.txt,"91 of 200 The Guardian (London) April 26, 199..."
article1011.txt,"13 of 173 The Guardian (London) November 11, ..."
article101215.txt,"92 of 200 The Guardian (London) April 26, 199..."
article101282.txt,"93 of 200 The Guardian (London) April 25, 199..."
article101346.txt,"94 of 200 The Guardian (London) April 19, 199..."
article101425.txt,"95 of 200 The Guardian (London) April 16, 199..."


***

## More resources:

#### Notebook shortcuts: https://www.cheatography.com/weidadeyue/cheat-sheets/jupyter-notebook/pdf_bw/
#### Regex tutorial: https://www.tutorialspoint.com/python/python_reg_expressions.htm
#### Working with Markdown: http://datascience.ibm.com/blog/markdown-for-jupyter-notebooks-cheatsheet/


***

In [74]:
# draft
# local <img src="mario.jpg", width=800, height=800>