### NLP

### Some important points

The task here is to convert the number words to numbers. 
I have divided the task into parts so that we can use this for some other puposes also and we will be able to extend and modify it easily if needed.
1. First function takes number texts only as input and convert those to integer. It doesn't handle the decimal case and will throw error if the input is any word other than the number word.
2. Second function is for handling the decimal case.
3. Third function just combines the first and second function. It takes any number words(including decimal cases) and convert them to corresponding numeric value.
4. The fourth function is to identify the number words from the given input. It returns start and end indices of the number words from which we can easily extract the number words if we want.
5. The final function uses all these functions. It takes any sentence as input. Then using fourth function it indentifies the number word parts and apply the third function on that part and finally return the required output.

Some of the important cases handled:
1. The word "and" can occur in between the number words. For eg- One hundred and ten.
2. The input may contain number words in capital letters.
3. The number words may end with some characters.For eg: "... sixty six: ...","... One million twenty; ...".

There are some other basic cases also which has been done.

Some cases can not be handled without knowing context. For those we need to train a model with some data.<br>
**For example**-<br>
Sentence 1-  "I have one **hundred and eighty** ruppes."<br>
Sentence 2-  "Velocity of cars are **hundred and eighty** kms per hour respectively"
<br>Both should not return same output for number words. Need to understand context.<br>
**Another example**- "You are the chosen **one**" .Here we should not convert one to 1.

### Lists of number words

In [1]:
units = ["zero", "one", "two", "three", "four", "five", "six", "seven", "eight","nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
"sixteen", "seventeen", "eighteen", "nineteen"]

tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]

scales = ["hundred", "thousand", "million", "billion", "trillion"]

number_system={}
for index, word in enumerate(units):    number_system[word] = (1, index)
for index, word in enumerate(tens):     number_system[word] = (1, index * 10)
for index, word in enumerate(scales):   number_system[word] = (10 ** (index * 3 or 2), 0)

### Function that only converts a text word to integer values

In [2]:
def text_to_integer(num_text):
    current = result = 0
    for word in num_text.split():
        if word not in number_system:
            raise Exception("Error! Invalid word" + word)

        scale, increment = number_system[word]
        current = current * scale + increment
        if scale > 100:
            result += current
            current = 0

    return result + current

### Function that compute the decimal texts to decimal values

In [3]:
decimal_words = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine']

def get_decimal_number(decimal_digit_words):
    decimal_number_str = []
    decimal_dic={}
    for idx, word in enumerate(decimal_words):  
        decimal_dic[word] =  idx
    for dec_word in decimal_digit_words:
        if(dec_word not in decimal_words):
            return int(0)
        else:
            decimal_number_str.append(decimal_dic[dec_word])
    final_decimal_string = '0.' + ''.join(map(str,decimal_number_str))
    return float(final_decimal_string)


In [4]:
extra_words=["and","point"]  ## List of words which we want to allow in between the number words

### Convert the number words to numbers</br>
#### Combining previous two functions

In [5]:
def text_to_num(number_sentence):
    number_sentence = number_sentence.lower() ## To get rid of capital letters
    if(number_sentence.isdigit()):  # return the number if user enters a number string(not necessary here)
        return int(number_sentence)
    total_sum=0
    num_words = number_sentence.partition("point")[0]
    decimal_words= number_sentence.partition("point")[2].split()
    total_sum+= text_to_integer(num_words)
    if decimal_words!=[]:
        total_sum+= get_decimal_number(decimal_words)
    return total_sum

### Extract indices those parts of the sentence which need to be converted to numbers </br>

In [6]:
def extract_numeric_word_index(sentence):
    sentence = sentence.lower() 
    words=sentence.split(" ")
    counter=0
    start_index= -1
    end_index= -1
    start_index_list=[]
    end_index_list=[]
    for i in range(len(words)):
        if words[i] in number_system.keys():
            if counter==0:
                start_index=i
                start_index_list.append(start_index)
                counter=1
            else:
                counter+=1
        else:
            if start_index!= -1:
                if words[i] not in extra_words or (words[i] in extra_words and words[i+1] not in number_system.keys()):
                    end_index=start_index + counter-1
                    end_index_list.append(end_index)
                    start_index= -1
                    counter=0
                else:
                    counter+=1
                    pass                      
            else:
                pass
    if counter!=0:
        end_index=start_index + counter-1
        end_index_list.append(end_index)      
    return((start_index_list,end_index_list)) 

### Final function which takes any sentence and convert the necessary parts to numbers

In [7]:
import re

In [8]:
## Characters which can occur just at the end of a number word.
## We can add all other characters.
characters=["-",",",";","=",":","!"] 

In [9]:
def text_to_numeric(sentence):
    ## Get rid of the case where some characters like (,;:!=?) occurs just after number word
    for i in characters:
        sentence=re.sub(i," "+i+"#",sentence)  ## used # just to identify where the changes are made 
                                               #so that we can change it back later
    sentence=re.sub("\."," ."+"#",sentence)
    converted_words=[]
    words=sentence.split(" ")
    start_index_list,end_index_list=extract_numeric_word_index(sentence)
    if start_index_list == []:
        return sentence
    else:
        for i in range(len(start_index_list)):
            word_list=words[start_index_list[i]:end_index_list[i]+1]
            indices = [i for i, x in enumerate(word_list) if x == "and"]
            word_list = [i for j, i in enumerate(word_list) if j not in indices]
            #print((' '.join(word_list)))
            number=text_to_num(' '.join(word_list))
            if i!=0:
                converted_words+=words[end_index_list[i-1]+1:start_index_list[i]]+ [str(number)]
            else:
                converted_words+=words[0:start_index_list[i]]+ [str(number)]  
            if i== len(start_index_list)-1:
                converted_words+= words[end_index_list[i]+1:]
    final=' '.join(converted_words)
    for i in characters:
        final=re.sub("\ "+i+"#",i,final)   ## The changes made previously are taken care of 
    final=re.sub("\ \."+"#",".",final)
    return(final)

### Check how the function works

In [21]:
text_to_integer("fifty nine") # Only converts to integer

59

In [23]:
get_decimal_number("five six")

0

In [10]:
text_1 = "A car starts from rest and accelerates uniformly over a time of five point two one seconds \
for a distance of one hundred and ten meters. Determine the acceleration of the car."
text_2 = "I don't have three point two three eight grams of sugar but I have one billion and two hundred and \
two point two zero ruppes"
text_3= "We together have one million and one thousand and thirty two. We want to split it \
in thirty two parts or forty five point five parts."

#### Decimal and "and" case 

#### Some other examples

In [11]:
text_to_numeric(text_1)

'A car starts from rest and accelerates uniformly over a time of 5.21 seconds for a distance of 110 meters. Determine the acceleration of the car.'

In [12]:
text_to_numeric(text_2)

"I don't have 3.238 grams of sugar but I have 1000000202.2 ruppes"

In [13]:
text_to_numeric(text_3)

'We together have 1001032. We want to split it in 32 parts or 45.5 parts.'

#### It also works for the third case mentioned above

In [16]:
text_4="you won one million! Yeah! Here your two thousand bonus."

In [17]:
text_to_numeric(text_4)

'you won 1000000! Yeah! Here your 2000 bonus.'