# Tutorial on Introduction to Python for NLP

## Outline
We will first setup Python 3.8 using anaconda, setup a virtual environment, and start with file operation basics involved with text processing

Python is a programming language that utilizes the Python interpreter. Here, each statement is executed one by one. This is quite different from a compiler-based languages like C, C++

## Setting up Python

1. To download Anaconda (Individial Edition), please visit their [website](https://www.anaconda.com/products/individual). Installation instructions for Windows, Linux and Mac is available [here](https://docs.anaconda.com/anaconda/install/).

For windows, open 'anaconda prompt' or command line

For Linux and Mac, please check whether **conda** is accessible from your terminal

2. We setup a virtual environment on conda <br>
<code>conda create -n nlp2021 python=3.8 </code>

To activate the environment, please type: <br>
<code>conda activate nlp2021 </code>
    
To deactivate the environment, please type: <br>
<code>conda deactivate</code>


3. Installing most commonly used Python packages <br>
<code>conda install jupyter pandas nltk numpy scikit-learn matplotlib</code>

In [1]:
print('Hello World') # print a single line statement

Hello World


In [2]:
# print multi-line statements
print('''Hello
World
''')

Hello
World



In [3]:
1 + 12

13

In [4]:
5 * 6

30

In [5]:
num1 = 1 + 12 # = is the assignment operator

In [6]:
num1 < 5 # This is a conditional statement

False

In [7]:
num1 == 13 # equality operator

True

In [8]:
type(num1) # the variable type assignment is automatically determined

int

## Conditional Statements

In [46]:
num1 = 20

if num1 < 10:
    print(num1, 'is less than 10') # block of code that is executed when if condition is true
    print('Condition 1 Line 1')
    print('Condition 1 Line 1')
elif num1 < 15:
    print(num1, 'is more than 10 but less than 15')
else:
    print(num1, 'is more than 15')

20 is more than 15


## Loops

In [10]:
for index in range(10):  # start - default value is 0, step default value is 1
    print(index)

0
1
2
3
4
5
6
7
8
9


In [11]:
for index in range(10,0,-1):
    print(index)

10
9
8
7
6
5
4
3
2
1


In [12]:
for index in range(10):
    if index % 2 == 0:
        print(index)

0
2
4
6
8


In [13]:
for index in range(10):
    if index == 4:
        continue
        
    if index % 2 == 0:
        print(index)
        
    if index == 7:
        break

0
2
6


## List and dictionary

list: may contain duplicate elements <br>
set: no duplicate elements <br>
dictionary: stored in a key-value format

In [14]:
list1 = [1,2,3,4,5,6,7,8,9,10]
list2 = [1]*5 + [10]*5
list3= list1 + list2
print('List1:', list1, 'List 2: ', list2, 'List 3: ', list3)

List1: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] List 2:  [1, 1, 1, 1, 1, 10, 10, 10, 10, 10] List 3:  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 1, 1, 1, 1, 10, 10, 10, 10, 10]


In [15]:
len(list3)

20

In [16]:
type(list3)

list

In [17]:
print(list3[10])

1


In [18]:
dict1= {'A': 'apple', 'B': 'ball', 'C': 'cat'} # this is an example of a Python dictioary

In [19]:
dict1['C']

'cat'

In [20]:
dict1.keys()

dict_keys(['A', 'B', 'C'])

In [21]:
for key in dict1.keys():
    print(dict1[key])

apple
ball
cat


## Reading and writing files

.csv - comma separated values
.tsv - tab separated values
.txt - simple text file

In [22]:
with open('data/introduction.txt', 'r') as f: #relative path
    print('Original text file: ', f.read())

Original text file:  Hello World
You are now reading the second line of the file



## Importing packages

To know how to use a particular package for a particular task.
Like using pandas package to read a .tsv file <br>

Tips to use functionalities of Python Packages
1. Google search using the following keywords like 'pandas read csv'
2. Read the corresponding official documentation (wehre they will mention the paramters that need to passed, the default values, and the return type 


In [23]:
import pandas as pd

In [24]:
# Download dataset from https://archive.ics.uci.edu/ml/machine-learning-databases/00461/drugLib_raw.zip
# Use this in terminal: wget https://archive.ics.uci.edu/ml/machine-learning-databases/00461/drugLib_raw.zip
# followed buy unzipping the compressed file: unzip drugLib_raw.zip

data_readpath = 'data/drugLib_raw/drugLibTrain_raw.tsv' # Relative path

In [25]:
train_data = pd.read_csv(data_readpath, sep='\t')

In [26]:
train_data.head()

Unnamed: 0.1,Unnamed: 0,urlDrugName,rating,effectiveness,sideEffects,condition,benefitsReview,sideEffectsReview,commentsReview
0,2202,enalapril,4,Highly Effective,Mild Side Effects,management of congestive heart failure,slowed the progression of left ventricular dys...,"cough, hypotension , proteinuria, impotence , ...","monitor blood pressure , weight and asses for ..."
1,3117,ortho-tri-cyclen,1,Highly Effective,Severe Side Effects,birth prevention,Although this type of birth control has more c...,"Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon...","I Hate This Birth Control, I Would Not Suggest..."
2,1146,ponstel,10,Highly Effective,No Side Effects,menstrual cramps,I was used to having cramps so badly that they...,Heavier bleeding and clotting than normal.,I took 2 pills at the onset of my menstrual cr...
3,3947,prilosec,3,Marginally Effective,Mild Side Effects,acid reflux,The acid reflux went away for a few months aft...,"Constipation, dry mouth and some mild dizzines...",I was given Prilosec prescription at a dose of...
4,1951,lyrica,2,Marginally Effective,Severe Side Effects,fibromyalgia,I think that the Lyrica was starting to help w...,I felt extremely drugged and dopey. Could not...,See above


In [27]:
print(train_data.iloc[0,0], train_data.iloc[0,1])

2202 enalapril


In [28]:
train_data.columns

Index(['Unnamed: 0', 'urlDrugName', 'rating', 'effectiveness', 'sideEffects',
       'condition', 'benefitsReview', 'sideEffectsReview', 'commentsReview'],
      dtype='object')

In [29]:
for row_index in range(train_data.shape[0]):
    print(train_data.iloc[row_index, 1])

enalapril
ortho-tri-cyclen
ponstel
prilosec
lyrica
propecia
vyvanse
elavil
xanax
claritin
flagyl
dextroamphetamine
sarafem
latisse
aldara
effexor-xr
neurontin
omnicef
klonopin
dovonex
protopic
effexor
sotret
retin-a-micro
lamotrigine
rebif
symbicort
lamictal
lyrica
doxycycline
vyvanse
actonel
provigil
ambien
wellbutrin
propecia
nortriptyline
imitrex
ativan
doxycycline
prozac
topamax
levitra
oxycodone
lamictal
oxycontin
vicodin
accutane
zocor
aldara
ortho-tri-cyclen
minocycline
estrace
meridia
sotret
prevacid
cosopt
renova
depakote
tekturna
renova
zegerid
sular
crestor
metformin
celexa
lexapro
naproxen
levoxyl
synthroid
spironolactone
minocycline
oracea
paxil
zantac
fosamax
prevacid
tirosint
cymbalta
ambien-cr
angeliq
imitrex
prempro
lexapro
latisse
wellbutrin-xl
biaxin
zantac
spironolactone
lipitor
omnicef
tazorac
alendronate
vyvanse
claripel-cream
valtrex
femring
neurontin
soma
lipitor
tylenol
ultram
chantix
ziana
vivelle-dot
effexor-xr
ultram-er
lotrel
viagra
seasonale
wellbutrin-sr


lamotrigine
lamictal
tramadol
xanax
metrogel
inderal
plavix
augmentin
ultram
naproxen
metformin
sulfasalazine
protonix
tamiflu
celebrex
yasmin
hyoscyamine
sarafem
lexapro
ortho-tri-cyclen
accutane
propecia
flexeril
tri-luma
imitrex
botox
zomig
emsam
vyvanse
flexeril
anafranil
ambien
alprazolam
vivelle-dot
climara
botox
neurontin
evista
chantix
femhrt
requip
prozac
zoloft
ultram
xanax
claritin
benadryl
skelaxin
differin
zoloft
restoril
diflucan
neurontin
adderall-xr
retin-a
zofran
paxil
meridia
acyclovir
abilify
minocycline
omnicef
remeron
prozac
prinivil
nexium
elidel
omnicef
proair-hfa
flexeril
differin
prevacid
vyvanse
effexor-xr
wellbutrin-xl
coreg
nexium
kenalog
zithromax
adderall
niaspan
protonix
elavil
metformin
tramadol
premarin
strattera
effexor
tramadol
retin-a
retin-a
citalopram
symbicort
climara
alprazolam
fluconazole
evista
seroquel
lorazepam
retin-a
flagyl
plavix
percocet
clonazepam
ribavirin
lexapro
humira
wellbutrin
lyrica
avita
buspar
wellbutrin-xl
retin-a-micro
restasi

## Using nltk package for text processing
We have already installed nltk

If while performing nltk.download() an error comes, as descibed in Stack Overflow (popular site used for debugging error messages

In [30]:
import ssl
try:
     _create_unverified_https_context =     ssl._create_unverified_context
except AttributeError:
     pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

In [31]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\roysoumya\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [32]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer


In [33]:
row_comment1= train_data.iloc[0, 6]
print(row_comment1)

slowed the progression of left ventricular dysfunction into overt heart failure 
alone or with other agents in the managment of hypertension 
mangagement of congestive heart failur


In [34]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\roysoumya\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [35]:
words= word_tokenize(row_comment1)
print(words)

['slowed', 'the', 'progression', 'of', 'left', 'ventricular', 'dysfunction', 'into', 'overt', 'heart', 'failure', 'alone', 'or', 'with', 'other', 'agents', 'in', 'the', 'managment', 'of', 'hypertension', 'mangagement', 'of', 'congestive', 'heart', 'failur']


In [36]:
sents= sent_tokenize(row_comment1)
print(sents)

['slowed the progression of left ventricular dysfunction into overt heart failure \r\r\nalone or with other agents in the managment of hypertension \r\r\nmangagement of congestive heart failur']


In [37]:
row_comment1_mod= row_comment1.replace('\r\r\n', '. ')
print(row_comment1_mod)

slowed the progression of left ventricular dysfunction into overt heart failure . alone or with other agents in the managment of hypertension . mangagement of congestive heart failur


In [38]:
print(sent_tokenize(row_comment1_mod))

['slowed the progression of left ventricular dysfunction into overt heart failure .', 'alone or with other agents in the managment of hypertension .', 'mangagement of congestive heart failur']


## Removing stopwords

In [39]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\roysoumya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [40]:
from nltk.corpus import stopwords

In [41]:
stopwords1 = set(stopwords.words('english'))
print(stopwords1)

{'in', 'through', 'again', "you've", 'of', 'too', 'its', "mustn't", "it's", 'some', 'but', 'now', 'were', 'itself', 're', 'on', 'herself', 'wouldn', 'don', 'what', 'most', 'against', 'more', 'have', 'not', "isn't", 'my', "should've", 'it', 'further', 'with', 'he', 'was', "haven't", 'yourselves', 'few', 'him', 'such', "needn't", 'and', 'as', 'you', 'isn', 'up', 'yourself', "won't", 'do', 'both', 'the', 'until', 'while', 'between', 'mustn', 'ain', 'when', 'about', 'into', 'out', 'under', 'their', 'once', 'mightn', 'during', 'does', 'am', 'having', "didn't", 'be', 'so', 'hadn', 't', 'ma', 'been', 'down', "shouldn't", 'we', 'which', 'here', "aren't", 'themselves', 'any', 'by', 'needn', 'if', 'how', 'why', 'those', "don't", "wasn't", 'd', 'his', 'didn', "you'll", 'above', 'all', "wouldn't", 'no', 'our', 'that', 'her', 'she', 'm', 'yours', 'where', 'should', 'myself', 'whom', 'wasn', 'shan', 'theirs', 'hasn', 'ourselves', 'shouldn', 'then', 'own', "doesn't", 'these', 'has', 'same', 'after', 

In [42]:
clean_text = ' '.join([word for word in word_tokenize(row_comment1) if word not in stopwords1 ]) # List comprehension
print('Before: \n', row_comment1, '\nAfter: \n', clean_text)

Before: 
 slowed the progression of left ventricular dysfunction into overt heart failure 
alone or with other agents in the managment of hypertension 
mangagement of congestive heart failur 
After: 
 slowed progression left ventricular dysfunction overt heart failure alone agents managment hypertension mangagement congestive heart failur


In [43]:
stemmer= PorterStemmer()

In [44]:
stemmed_text = ' '.join([stemmer.stem(word) for word in word_tokenize(row_comment1)]) # List comprehension
print('Before: \n', row_comment1, '\nAfter: \n', stemmed_text)

Before: 
 slowed the progression of left ventricular dysfunction into overt heart failure 
alone or with other agents in the managment of hypertension 
mangagement of congestive heart failur 
After: 
 slow the progress of left ventricular dysfunct into overt heart failur alon or with other agent in the manag of hypertens mangag of congest heart failur


## References
1. Automate the Boring Stuff With Python - https://automatetheboringstuff.com
2. NLP Course Interactive Scribes - https://github.com/krishnamrith12/NotebooksNLP