<a href="https://colab.research.google.com/github/ssh1419/NLP/blob/main/April_26_2022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
def describe(x):
    print("Type: {}".format(x.type()))
    print("Shape/size: {}".format(x.shape))
    print("Values: \n{}".format(x))

In [3]:
import torch
import numpy as np
npy = np.random.rand(2, 3)
describe(torch.from_numpy(npy))

Type: torch.DoubleTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[0.2998, 0.7225, 0.8179],
        [0.6285, 0.3550, 0.3808]], dtype=torch.float64)


In [4]:
x = torch.FloatTensor([[1, 2, 3],  
                      [4, 5, 6]])
describe(x)

Type: torch.FloatTensor
Shape/size: torch.Size([2, 3])
Values: 
tensor([[1., 2., 3.],
        [4., 5., 6.]])


In [5]:
describe(torch.cat([x, x], dim=0))

Type: torch.FloatTensor
Shape/size: torch.Size([4, 3])
Values: 
tensor([[1., 2., 3.],
        [4., 5., 6.],
        [1., 2., 3.],
        [4., 5., 6.]])


In [6]:
describe(torch.cat([x, x], dim=1))

Type: torch.FloatTensor
Shape/size: torch.Size([2, 6])
Values: 
tensor([[1., 2., 3., 1., 2., 3.],
        [4., 5., 6., 4., 5., 6.]])


In [1]:
import torch
print (torch.cuda.is_available())

# 수정/노트 설정
## 여기서 cuda 로 바꿀 수 있음

True


In [2]:
# preferred method: device agnostic tensor instantiation
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print (device)

cuda


In [7]:
# Exercise

## 1.
a = torch.rand(3, 3)
a.unsqueeze(0)

tensor([[[0.3123, 0.0342, 0.7939],
         [0.6343, 0.3780, 0.6126],
         [0.7580, 0.7069, 0.9784]]])

In [8]:
## 2.
a.squeeze(0)

tensor([[0.3123, 0.0342, 0.7939],
        [0.6343, 0.3780, 0.6126],
        [0.7580, 0.7069, 0.9784]])

In [6]:
## 3.
3 + torch.rand(5, 3) * (7 - 3)

tensor([[6.7775, 4.5199, 3.9935],
        [3.4617, 4.3688, 3.7699],
        [6.6716, 4.7270, 5.0996],
        [4.0167, 5.8897, 3.7229],
        [6.8974, 6.9758, 3.8514]])

In [None]:
## 4.
a = torch.rand(3, 3)
a.normal_()

## 5.
a = torch.rand(3, 1)
a.expand(3, 4)

# Chapter 2. A Quick Tour of Traditional NLP

## Corpora, Tokens, and Types

*   A corpus usually contains raw text (in ASCII or UTF-8)
*   The text along with its metadata is called an instance or data point.
*  The process of breaking a text down into tokens is called tokenization.

In [10]:
import spacy
nlp = spacy.load('en')
text = "Mary, don’t slap the green witch"
print([str(token) for token in nlp(text.lower())])

['mary', ',', 'do', 'n’t', 'slap', 'the', 'green', 'witch']


In [14]:
from nltk.tokenize import TweetTokenizer
tweet="Snow White and the Seven Degrees #MakeAMovieCold@midnight:-)"
tokenizer = TweetTokenizer()
print(tokenizer.tokenize(tweet.lower()))

['snow', 'white', 'and', 'the', 'seven', 'degrees', '#makeamoviecold', '@midnight', ':-)']


## Unigrams, Bigrams, Trigrams, …, N-grams



*   N-grams are fixed-length (n) consecutive token sequences occurring in the text.
*   A bigram has two tokens, a unigram one.



### Example 2-2. Generating n-grams from text

In [16]:
def n_grams(text, n):
    '''
    takes tokens or text, returns a list of n-grams
    '''
    return [text[i:i+n] for i in range(len(text)-n+1)]

cleaned = ['mary', ',', "n't", 'slap', 'green', 'witch', '.']
print(n_grams(cleaned, 3))

[['mary', ',', "n't"], [',', "n't", 'slap'], ["n't", 'slap', 'green'], ['slap', 'green', 'witch'], ['green', 'witch', '.']]


## Lemmas and Stems

*   Lemmas are root forms of words.
*   It can be inflected into many different words—flow, flew, flies, flown, flowing, and so on—and fly is the lemma



### Example 2-3. Lemmatization: reducing words to their root forms

In [17]:
# token.lemma_

import spacy
nlp = spacy.load('en')
doc = nlp(u"he was running late")
for token in doc:
    print('{} --> {}'.format(token, token.lemma_))

he --> -PRON-
was --> be
running --> run
late --> late




*   Stemming is the poor-man’s lemmatization.
*   It involves the use of handcrafted rules to strip endings of words to reduce them to a common form called stems.
*   Porter and Snowball stemmers



## Categorizing Sentences and Documents

*   Categorizing or classifying documents is probably one of the earliest applications of NLP. 
*   TF and TF-IDF




## Categorizing Words: POS Tagging

*   A common example of categorizing words is part-of-speech (POS) tagging

### Example 2-4. Part-of-speech tagging

In [18]:
# token.pos_

import spacy
nlp = spacy.load('en')
doc = nlp(u"Mary slapped the green witch.")
for token in doc:
    print('{} - {}'.format(token, token.pos_))

Mary - PROPN
slapped - VERB
the - DET
green - ADJ
witch - NOUN
. - PUNCT


## Categorizing Spans: Chunking and Named Entity Recognition

*   This is called chunking or shallow parsing.

### Example 2-5. Noun Phrase (NP) chunking

In [19]:
# noun_chunks

import spacy
nlp = spacy.load('en')
doc  = nlp(u"Mary slapped the green witch.")
for chunk in doc.noun_chunks:
    print ('{} - {}'.format(chunk, chunk.label_))

Mary - NP
the green witch - NP


## Structure of Sentences

*   The task of identifying the relationship between them is called ***parsing***.
*   Parse trees indicate how different grammatical units in a sentence are related hierarchically.



## Word Senses and Semantics

*   The different meanings of a word are called its senses. 
*   Automatic discovery of word senses from text was actually the first place semi-supervised learning was applied to NLP. 

