# Preparing textual data for statistics and machine learning

1. Importing the dataset
2. Cleaning the dataset
3. Tokenization
4. Feature extraction on a large dataset



## Importing Data
Reddit Self-Posts dataset avalaible on Kaggle

In [1]:
import pandas as pd

In [2]:
posts_file = "rspct.tsv"

In [3]:
posts_df = pd.read_csv(posts_file, sep='\t')

In [4]:
posts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1013000 entries, 0 to 1012999
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   id         1013000 non-null  object
 1   subreddit  1013000 non-null  object
 2   title      1013000 non-null  object
 3   selftext   1013000 non-null  object
dtypes: object(4)
memory usage: 30.9+ MB


In [5]:
posts_df.head()

Unnamed: 0,id,subreddit,title,selftext
0,6d8knd,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, fi..."
1,58mbft,teenmom,"So what was Matt ""addicted"" to?",Did he ever say what his addiction was or is h...
2,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...
3,6ti6re,ringdoorbell,"Not door bell, but floodlight mount height.",I know this is a sub for the 'Ring Doorbell' b...
4,77sxto,intel,Worried about my 8700k small fft/data stress r...,"Prime95 (regardless of version) and OCCT both,..."


In [5]:
subred_file = "subreddit_info.csv"
subred_df=pd.read_csv(subred_file).set_index(['subreddit'])

In [6]:
subred_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3394 entries, whatsthatbook to Glitch_in_the_Matrix
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   category_1            3394 non-null   object
 1   category_2            3362 non-null   object
 2   category_3            536 non-null    object
 3   in_data               3394 non-null   bool  
 4   reason_for_exclusion  2381 non-null   object
dtypes: bool(1), object(4)
memory usage: 135.9+ KB


In [7]:
subred_df.head()

Unnamed: 0_level_0,category_1,category_2,category_3,in_data,reason_for_exclusion
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
whatsthatbook,advice/question,book,,True,
CasualConversation,advice/question,broad,,False,too_broad
Clairvoyantreadings,advice/question,broad,,False,too_broad
DecidingToBeBetter,advice/question,broad,,False,too_broad
HelpMeFind,advice/question,broad,,False,too_broad


In [9]:
subred_df.loc['Harley']

category_1                        autos
category_2              harley davidson
category_3                          NaN
in_data                            True
reason_for_exclusion                NaN
Name: Harley, dtype: object

In [8]:
df=posts_df.join(subred_df, on ='subreddit')

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1013000 entries, 0 to 1012999
Data columns (total 9 columns):
 #   Column                Non-Null Count    Dtype 
---  ------                --------------    ----- 
 0   id                    1013000 non-null  object
 1   subreddit             1013000 non-null  object
 2   title                 1013000 non-null  object
 3   selftext              1013000 non-null  object
 4   category_1            1013000 non-null  object
 5   category_2            1013000 non-null  object
 6   category_3            136000 non-null   object
 7   in_data               1013000 non-null  bool  
 8   reason_for_exclusion  0 non-null        object
dtypes: bool(1), object(8)
memory usage: 62.8+ MB


In [10]:
df.head()

Unnamed: 0,id,subreddit,title,selftext,category_1,category_2,category_3,in_data,reason_for_exclusion
0,6d8knd,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, fi...",writing/stories,tech support,,True,
1,58mbft,teenmom,"So what was Matt ""addicted"" to?",Did he ever say what his addiction was or is h...,tv_show,teen mom,,True,
2,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...,autos,harley davidson,,True,
3,6ti6re,ringdoorbell,"Not door bell, but floodlight mount height.",I know this is a sub for the 'Ring Doorbell' b...,hardware/tools,doorbells,,True,
4,77sxto,intel,Worried about my 8700k small fft/data stress r...,"Prime95 (regardless of version) and OCCT both,...",electronics,cpu,intel,True,


### Standardizing Attributes Names

Usual practise:
- **df**: name of the dataset
- **text**: name of the column containing text to analyze

In [11]:
print(df.columns)

Index(['id', 'subreddit', 'title', 'selftext', 'category_1', 'category_2',
       'category_3', 'in_data', 'reason_for_exclusion'],
      dtype='object')


#### Renaming columns

- selftext renamed as text
- category_1 renamed as category
- category_2 renamed as subcategory
- category_3, in_data and reason_for_exclusion are suppressed (incomplete data)

In [12]:
column_mapping = {
    'id':'id',
    'subreddit':'subreddit',
    'title':'title',
    'selftext':'text',
    'category_1':'category',
    'category_2':'subcategory',
    'category_3': None,
    'in_data': None,
    'reason_for_exclusion': None
}

In [13]:
column_mapping['selftext']

'text'

In [14]:
columns=[c for c in column_mapping.keys() if column_mapping[c] != None]

In [15]:
print(columns)

['id', 'subreddit', 'title', 'selftext', 'category_1', 'category_2']


In [16]:
df=df[columns].rename(columns=column_mapping)

In [17]:
print(df.columns)

Index(['id', 'subreddit', 'title', 'text', 'category', 'subcategory'], dtype='object')


### Selection of data for the autos category

In [18]:
df=df[df['category']=='autos']

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20000 entries, 2 to 1012979
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           20000 non-null  object
 1   subreddit    20000 non-null  object
 2   title        20000 non-null  object
 3   text         20000 non-null  object
 4   category     20000 non-null  object
 5   subcategory  20000 non-null  object
dtypes: object(6)
memory usage: 1.1+ MB


In [20]:
df.head()

Unnamed: 0,id,subreddit,title,text,category,subcategory
2,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...,autos,harley davidson
56,5s0q8r,Mustang,Roush vs Shleby GT500,"I am trying to determine which is faster, and ...",autos,ford
78,5z3405,Volkswagen,2001 Golf Wagon looking for some insight,Hello! <lb><lb>Trying to find some information...,autos,VW
270,7df18v,Lexus,IS 250 Coolant Flush/Change,https://www.cars.com/articles/how-often-should...,autos,lexus
286,5tpve8,volt,Gen1 mpg w/ dead battery?,"Hi, new to this subreddit. I'm considering bu...",autos,chevrolet


# Cleaning Text Data
We don't have well edited texts. There are several problems of quality that we need to take into account:
- **Salutations, signatures and adresses**: usually not informative

- **Replies**: in case the text contains replies repeating the question, we need to eliminate the duplicated question. If not, we can introduce bias in the statistical analysis.
    
- **Special formatting and program code**: in case, the text contain special characters, HTML entities, Mardown tags,...Necessary to eliminate these signs before the analysis.


In [23]:
import re # re : standard librery for regular expressions

In [24]:
### Evaluating the ratio of suspicious characters

In [25]:
RE_SUSPICIOUS = re.compile(r'[&#<>{}\[\]\\]')

def impurity(text, min_len=10):
    """returns the share of suspicious characters in a text"""
    if text==None or len(text)<min_len:
        return 0
    else:
        return len(RE_SUSPICIOUS.findall(text))/len(text)
        

In [26]:
text=df.iloc[3]['text']
print(text)

https://www.cars.com/articles/how-often-should-i-change-engine-coolant-1420680853669/<lb><lb>I have a IS 250 AWD from 2006. About 73K miles on it. I've never touched the engine radiator coolant and can't find anything on when to change this in the book. It just says 'long life 100k Toyota coolant.' <lb><lb>Does anyone get this flushed or changed at ten years?? Do I wait until 100k? 


In [27]:
impurity(text)

0.02077922077922078

In [28]:
df['impurity']=df['text'].apply(impurity,min_len=10)

### Removing noise with regular expressions

In [29]:
import html
def clean(text):
    # convert html.unescape(text)
    text = html.unescape(text)
    # tags like <tab>
    text=re.sub(r'<[^<>]*>',' ',text)
    #mardown URLs like [Some text](http.//...)
    text=re.sub(r'\[([^\[\]]*)\]\([^\(\)]*\)', r'\1',text)
    # text or code in brackets like [0]
    text=re.sub(r'\[[^[\]*]', ' ', text)
    # standalone sequences of specials, matches &# but not #cool
    text=re.sub(r'(?:^|\s)[&#<>{}\[\]+|\\:-]+|\\:-]{1,}(?:\s|$)',' ',text)
    # standalone sequences of hyphen like --- or ==
    text=re.sub(r'(?:^|\s)[\-=\+]{2,}(?:\s|$)',' ',text)
    #sequence of white spaces
    text=re.sub(r'\s+', ' ', text)
    return text.strip()

In [30]:
clean_text = clean(text)

In [31]:
print(clean_text)

https://www.cars.com/articles/how-often-should-i-change-engine-coolant-1420680853669/ I have a IS 250 AWD from 2006. About 73K miles on it. I've never touched the engine radiator coolant and can't find anything on when to change this in the book. It just says 'long life 100k Toyota coolant.' Does anyone get this flushed or changed at ten years?? Do I wait until 100k?


In [32]:
impurity(clean_text)

0.0

In [33]:
df['clean_text']=df['text'].map(clean)

In [34]:
df['impurity']=df['clean_text'].apply(impurity, min_len=20)

In [35]:
df[['clean_text','impurity']].sort_values(by='impurity',ascending=False).head(3)

Unnamed: 0,clean_text,impurity
356461,Split b/w 2 genesis options. Hyundai Genesis\ ...,0.039088
957625,"At the dealership, they offered an option for ...",0.026455
836076,"I am looking at four Caymans, all are in a sim...",0.024631


In [36]:
import textacy

ModuleNotFoundError: No module named 'textacy'

# Tokenization

In [36]:
import nltk
#nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ylepen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [37]:
print(text)

https://www.cars.com/articles/how-often-should-i-change-engine-coolant-1420680853669/<lb><lb>I have a IS 250 AWD from 2006. About 73K miles on it. I've never touched the engine radiator coolant and can't find anything on when to change this in the book. It just says 'long life 100k Toyota coolant.' <lb><lb>Does anyone get this flushed or changed at ten years?? Do I wait until 100k? 


In [38]:
tokens=nltk.tokenize.word_tokenize(text)

In [39]:
print(*tokens,sep='|')

https|:|//www.cars.com/articles/how-often-should-i-change-engine-coolant-1420680853669/|<|lb|>|<|lb|>|I|have|a|IS|250|AWD|from|2006|.|About|73K|miles|on|it|.|I|'ve|never|touched|the|engine|radiator|coolant|and|ca|n't|find|anything|on|when|to|change|this|in|the|book|.|It|just|says|'long|life|100k|Toyota|coolant|.|'|<|lb|>|<|lb|>|Does|anyone|get|this|flushed|or|changed|at|ten|years|?|?|Do|I|wait|until|100k|?


# spaCy

In [44]:
import spacy

## Linguistic Processing with spaCy

In [None]:
- Spacy: library for linguistic data processing
- Spacy provide an integrated pipeline of processing documents:
    
    1. a tokenizer (by default)
    2. a part-of-speech tagger  
    3. a dependency parser
    4. a named-entity recognizer
    
- the tokenizes is based on language-dependent rules = > fast
- 2, 3 and 4 are based on pretrained neural models => can 10-20 times as long as tokenization

- The initial input is a text

- The final output is a **Doc** object

- The **Doc** object contains a list of **Tokens** objects

- Any range selection of tokens creates a **Span**

### Instantiating the pipeline

- We need to import an model file to use
- 'en_core_web_sm' : model file for english
- 'fr_core_news_sm'

https://spacy.io/usage/models# list of models language for spaCy

In [None]:
from spacy.cli import download
print(download('en_core_web_sm'))

The variable for the language object is usually called nlp

In [48]:
nlp=spacy.load('en_core_web_sm')

Components of the pipeline

In [49]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1f4d1526ab0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1f4b19cebd0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1f4bc3069d0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1f4d1666850>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1f4d161a190>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1f4bc3068f0>)]

In [None]:
We can import selected elements of the pipeline

In [54]:
nlp_2=spacy.load('en_core_web_sm', disable=["parser","ner"])

###  Processing Text 

In [78]:
text='My best friend Ryan Peters like to travel in dangerous countries.'

In [79]:
doc=nlp(text)

In [80]:
print(doc)

My best friend Ryan Peters like to travel in dangerous countries.


In [81]:
for token in doc:
    print(token,end='|')

My|best|friend|Ryan|Peters|like|to|travel|in|dangerous|countries|.|

In [82]:
I f we want to use just the tokenizer

SyntaxError: invalid syntax (2296230091.py, line 1)

In [83]:
nlp.make_doc(text)

My best friend Ryan Peters like to travel in dangerous countries.

In [None]:
### Attributes of Token:

    - token.is_punct  : Is the token punctuation? 
    - token.is_alpha  : Does the token consist of alphabetic characters? 
    - token.like_email : Does the token resemble an email address?
    - token.like_url : : Does the token resemble a URL?
    - token.is_stop : Is the token part of a “stop list”?
    - token.lemma_ : Base form of the token, with no inflectional suffixes.
    - token.pos : core part-of-speech categories https://universaldependencies.org/u/pos/
            
            
See https://spacy.io/api/token for the list of all attributes

In [84]:
for token in doc:
    print(token,token.is_punct)

My False
best False
friend False
Ryan False
Peters False
like False
to False
travel False
in False
dangerous False
countries False
. True


In [85]:
for token in doc:
    print(token,token.is_stop)

My True
best False
friend False
Ryan False
Peters False
like False
to True
travel False
in True
dangerous False
countries False
. False


In [86]:
for token in doc:
    print(token,token.is_alpha)

My True
best True
friend True
Ryan True
Peters True
like True
to True
travel True
in True
dangerous True
countries True
. False


In [87]:
for token in doc:
    print(token,token.lemma_)

My my
best good
friend friend
Ryan Ryan
Peters Peters
like like
to to
travel travel
in in
dangerous dangerous
countries country
. .


In [None]:
## Customizing Tokenization

Sometimes, it is necessary to adjust the Tokenizer to take into account hyphen, underscore, hash sign #, 

In [75]:
text = "@Pete: can't choose low-carb # food #eat-smart. _url_ ; -) "
doc = nlp.make_doc(text)
for token in doc:
    print(token, end="|")

@Pete|:|ca|n't|choose|low|-|carb|#|food|#|eat|-|smart|.|_|url|_|;|-|)|

## Working with stop words

In [None]:
- spaCy uses language-specific stop word lists to set the is_stop property for each token
- Filtering stop words (and punctuation tokens) is easy

In [None]:
text = "Dear Ryan, we need to sit down and talsk. Regards, Pete"
doc = nlp(text)
non_stop = [t for t in doc if not t.is_stop and not t.is_punct]
print(non_stop)