## Import & Clean Document and Count Tokens

#### In this example, we will be doing the following:

- Importing 2 documents
- Cleaning 2 documents
  * Lowercasing all text
  * Creating list of all words
  * Removing punctuation
  * Removing stop words
  * Creating a new list of tokens
- Counting occurences of all tokens
- Converting this list to a dataframe

### Set-up

Import packages we will be using

In [26]:
import re, string
import pandas as pd
import numpy as np
from collections import Counter
from docx import Document
from nltk.corpus import stopwords

### Import Documents

Create function to read Word doc and append to string

In [3]:
def getText(filename):
    doc = Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

Import both articles and save them as variables

In [33]:
full_doc1 = getText('SGM_Doc1_When_Airbnb_Launched.docx')

In [35]:
full_doc2 = getText('SGM_Doc2_Why_Rewards_for.docx')

Preview document

In [34]:
full_doc1

'When Airbnb launched in 2008, the tech start-up offered cash-strapped travelers cheap overnight stays on sofas and in spare bedrooms around the world.\n\nWhat a difference 11 years makes.\n\nWith the introduction of a new rental tier aptly called Airbnb Luxe, the company’s accommodation options, which had already expanded to include entire houses and even some upscale ones at that, now include the high-end market.\n\n“We have an overall strategy of having a product for every traveler, and Luxe is for the ones seeking luxury,” said Brian Chesky, Airbnb’s chief executive and co-founder.\n\nThe Luxe portfolio of 2,000 homes, available starting Tuesday, includes villas in Tuscany, ski lodges in New Zealand and castles in the French countryside. They were selected from the 5,000 properties listed on Luxury Retreats, a high-end vacation rental company that Airbnb acquired in 2017.\n\nThe customers for these homes, Mr. Chesky said, are the same guests who rented sofas with the company when t

In [36]:
full_doc2

'Why Rewards for Loyal Spenders Are ‘a Honey Pot for Hackers’\nThe punch cards stuffed in your wallet know next to nothing about you, except maybe how many frozen yogurts you still need to buy to get a free one.\n\nBut loyalty programs, as they shift from paper and plastic to apps and websites, are increasingly tracking a currency that can be more valuable than how much you spend: personal data. As a result, the programs know things about you that some of your friends may not, like your favorite flavor (mango), when your cravings strike (early afternoon) and how you pay (with your Visa), in addition to billing details and contact information.\n\nHackers are in close pursuit.\n\nOne loyalty-fraud prevention group estimates, conservatively, that $1 billion a year is lost to crime related to the programs. As a share of fraud not involving a physical payment card, such schemes more than doubled from 2017 to 2018, according to the Javelin Strategy & Research firm.\n\nSome criminals use stol

### Clean Documents

Create list of words from article, while changing to lowercas and removing punctuation and stopwords (modified from Paul's code)

In [28]:
def clean_token(doc): 
    #lowercase all words
    doc = doc.lower()
    #split document into individual words
    tokens = doc.split()
    # remove punctuation from each word
    re_punc = re.compile('[%s]' % re.escape(string.punctuation))
    tokens = [re_punc.sub('', w) for w in tokens]
    # remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]    
    return tokens

In [38]:
word_list1 = clean_token(full_doc1)

In [39]:
word_list2 = clean_token(full_doc2)

### Count Tokens

In [40]:
token_count1 = Counter(word_list1).most_common()

In [41]:
token_count2 = Counter(word_list2).most_common()

### Convert to Dataframe

In [42]:
tokens_df1 = pd.DataFrame(token_count1, columns=["token", "count"]) 

In [43]:
tokens_df2 = pd.DataFrame(token_count2, columns=["token", "count"]) 

In [44]:
tokens_df1

Unnamed: 0,token,count
0,night,6
1,airbnb,5
2,luxe,5
3,said,5
4,chesky,4
5,homes,4
6,new,3
7,include,3
8,includes,3
9,luxury,3


In [45]:
tokens_df2

Unnamed: 0,token,count
0,said,7
1,loyalty,5
2,programs,5
3,year,5
4,account,5
5,security,5
6,one,4
7,data,4
8,hilton,4
9,information,3
