# Text Analysis using Python

Project Summary: Language Translation and Text Analysis

The project focuses on developing a language translation and text analysis tool using Python. It encompasses several functions and components to perform various operations on textual data. Here is a summary of the key functionalities:

1. Text Processing:
   - The project includes functions to read text from files, remove punctuation, split text into sentences, and extract word lists from sentences.
   - The `read_file` function reads the contents of a file and returns them as a string.
   - The `remove_punc2` function removes punctuation characters from a given text, creating a clean version of the text.

2. Language Translation:
   - The translation functionality involves converting English words to their Latin counterparts.
   - The `translate` function utilizes a rule-based approach to transform English words to Latin. It identifies the first vowel occurrence, moves consonants before it to the end, and appends the suffix 'ay'.

3. Text Analysis:
   - The project includes functions for analyzing and manipulating text data.
   - The `split_sentences` function splits a text string into a list of sentences based on common sentence-ending markers like '.', '?', and '!'.
   - The `starts_with_vowel` function checks if a given string starts with a vowel.
   - The `following` function retrieves the list of users that a specific user is following in a social network, based on a list of follower-followee relationships.

4. Social Network Analysis:
   - The project incorporates functionalities related to social network analysis.
   - The `list_textfiles` function returns a list of filenames ending in '.txt' in a specified directory.
   - The code reads follower-followee relationships from a file and stores them in a dictionary data structure (`edge_dict`).
   - The `following2` function utilizes the `edge_dict` to find the list of users a given user is following.

5. Performance Evaluation:
   - The project includes the use of the `%timeit` magic command to measure the execution time of specific functions.

Overall, the project combines various components to offer language translation capabilities, text analysis functionalities, and social network analysis features. It provides users with tools to process text data, analyze relationships in social networks, and translate words from English to Latin.

In [2]:
# Importing the data
infile = open('/Users/yogeshgupta/Downloads/austen-emma-excerpt.txt')
print(infile)
text = infile.read()
print(text.count("e"))
print(text.count('an'))
infile.close()

<_io.TextIOWrapper name='/Users/yogeshgupta/Downloads/austen-emma-excerpt.txt' mode='r' encoding='UTF-8'>
78
12


In [3]:
# The code reads a text and counts the number of occurrences of the letter 'e' in the text. 
# It then prints the original text and the count of 'e' occurrences.
nE = 0
print(text)
for x in text:
    if 'e' in x:
        nE = nE + x.count('e')

print(nE)

﻿ Emma by Jane Austen 1816

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

78


In [5]:
print(text.count(' an '))

2


In [7]:
# This code counts the number of occurrences of the letter 'e' in the `text` string and stores the 
# count in the variable `counts`, then prints the count.
counts = 0 
item_to_count = text
for txt in text:
    if 'e' == txt:
        counts = counts + 1

print(counts)

78


In [10]:
# The `remove_punc2` function removes punctuation characters from the `text` string and returns the cleaned version.
def remove_punc2(text):
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>>,.?/~`»¿'
    clean_text = ""
    for character in text:
        if character not in punctuation:
            clean_text += character
    return clean_text

#short_text = "Commas, as it turns out, are overestimated. Dots, however, even more so!"
#words = remove_punc2(text)
#count = {}
#textSplit = words.split(' ')
#for x in textSplit:
#        if x in text:
#            count[x] = count[x] + 1
#        else:
#            count[x] = 1
        #print('The word '+ x +' occurred ' +str(wordCount))
#print(count)

In [12]:
for element in enumerate("Python"):
    print(element)

(0, 'P')
(1, 'y')
(2, 't')
(3, 'h')
(4, 'o')
(5, 'n')


In [14]:
for index, character in enumerate("Python"):
    print(index)

0
1
2
3
4
5


In [19]:
# the code reads the contents of a specific file, prints the content, lists all text files in a 
# directory, reads each text file, and prints the filepath along with the length of the text in each file.
def read_file(filename):
    "Read the contents of FILENAME and return as a string."
    infile = open(filename) # windows users should use codecs.open after importing codecs
    contents = infile.read()
    infile.close()
    return contents

fileContent = read_file("/Users/yogeshgupta/Downloads/textanalysis/austen-emma-excerpt.txt")
print(fileContent)

from os import listdir
listdir("/Users/yogeshgupta/Downloads/textanalysis")

def list_textfiles(directory):
    "Return a list of filenames ending in '.txt' in DIRECTORY."
    textfiles = []
    for filename in listdir(directory):
        if filename.endswith(".txt"):
            textfiles.append(directory + "/" + filename)
    return textfiles

for filepath in list_textfiles("/Users/yogeshgupta/Downloads/textanalysis"):
    text = read_file(filepath)
    print(filepath +  " has " + str(len(text)) + " characters.")

﻿ Emma by Jane Austen 1816

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

/Users/yogeshgupta/Downloads/textanalysis/austen-emma-excerpt.txt has 703 characters.


In [21]:
def end_of_sentence_marker(character):
    if character == '?':
        return True
    elif character == '!':
        return True
    elif character == '.':
        return True
    else:
        return False

In [23]:
print(end_of_sentence_marker('?') == True)

True


In [25]:
print(end_of_sentence_marker("a"))

False


In [27]:
for element in enumerate('Python'):
    print(element)

(0, 'P')
(1, 'y')
(2, 't')
(3, 'h')
(4, 'o')
(5, 'n')


In [29]:
for index, character in enumerate("Python"):
    print(index)

0
1
2
3
4
5


In [31]:
# The split_sentences function takes a text string as input and splits it into a list of sentences.
def split_sentences(text):
    #"Split a text string into a list of sentences."
    sentences = []
    start = 0
    for end, character in enumerate(text):
        if end_of_sentence_marker(character):
            sentence = text[start: end + 1]
            sentences.append(sentence)
            start = end + 1
    return sentences

In [33]:
splitedSentences = split_sentences("This is a sentence. Should we seperate it from this one?")

In [35]:
# the code processes each sentence from the splitedSentences list. It removes leading and trailing whitespace, removes punctuation, converts the 
# sentence to lowercase, splits it into individual words, and prints the resulting list of words for each sentence.
for index,sent in enumerate(splitedSentences):
    wordList = []
    sent = sent.strip()
    cleanText = remove_punc2(sent)
    lowerSent = cleanText.lower()
    wordList = lowerSent.split(' ')
    print(wordList)

['this', 'is', 'a', 'sentence']
['should', 'we', 'seperate', 'it', 'from', 'this', 'one']


In [36]:
# To Remove extension of a file , we can use "os.path.splitext" !
# To Remove directory and gives only the file name , we can use "os.path.basename" !
# Using the above two function we can get only the required file name !

In [39]:
# the code reads a file containing follower and followee names, extracts the pairs of names, 
# and stores them in a list named edges. It then prints the first 10 pairs of follower and followee names.

edges = [] # In twitterName.txt we have list of names in the format as 'follower','followee'
for line in open('/Users/yogeshgupta/Downloads/textanalysis/twitterName.txt'):
    follower,followee = line.strip().split(';')
    edges.append((follower,followee))
print(edges[:10])

[('@Fox', '@Judie'), ('@Tristan', '@Jermain'), ('@Allyn', '@Winfred'), ('@Dennis', '@Randolph'), ('@Wallie', '@Venkat'), ('@Fo', '@Judi'), ('@Trista', '@Jermai'), ('@lyn', '@Winfre'), ('@ennis', '@Randolp'), ('@llie', '@Venka')]


In [41]:
# the code defines a function following that retrieves the list of users that a given user is following based on 
# the provided follower-followee pairs in the edges list. It then prints the list of followees for the user "@Fox".
def following(user, edges):
    "Return a list of all users USERS is following."
    followees = []
    for follower, followee in edges:
        if follower == user:
            followees.append(followee)
    return followees

print(following("@Fox", edges)) # The User Fox(follower) is following 6 People(Followee)

['@Judie', '@Judie', '@Jermain', '@Winfred', '@Randolph', '@Venkat']


In [43]:
%timeit following("@Fox", edges)

1.17 µs ± 25 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [46]:
edge_dict = {}
for line in open("/Users/yogeshgupta/Downloads/textanalysis/twitterName.txt"):
    name_a, name_b = line.strip().split(';')
    if name_a in edge_dict:
        edge_dict[name_a].append(name_b)
    else:
        edge_dict[name_a] = [name_b] # We are writing to an Dictionary 'key' as Follower and 'Value as Follower'

In [48]:
edge_dict

{'@Fox': ['@Judie', '@Judie', '@Jermain', '@Winfred', '@Randolph', '@Venkat'],
 '@Tristan': ['@Jermain'],
 '@Allyn': ['@Winfred'],
 '@Dennis': ['@Randolph'],
 '@Wallie': ['@Venkat'],
 '@Fo': ['@Judi'],
 '@Trista': ['@Jermai'],
 '@lyn': ['@Winfre'],
 '@ennis': ['@Randolp', '@Raolph'],
 '@llie': ['@Venka'],
 '@ox': ['@Jud'],
 '@ristan': ['@Jerma'],
 '@llyn': ['@Winfr'],
 '@allie': ['@nkat'],
 '@Fx': ['@Jie'],
 '@Trstan': ['@rmain'],
 '@Alln': ['@nfred'],
 '@Denns': ['@ndolph'],
 '@Walli': ['@enkat']}

In [50]:
def following2(user, edges):
    return edges[user]

%timeit following2("@Fox", edge_dict)

87 ns ± 0.152 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


In [53]:
edges = []
for line in open("/Users/yogeshgupta/Downloads/textanalysis/twitterName.txt"):
    name_a, name_b = line.strip().split(';')
    # repeatedly add edges to the network (1000 times)
    for i in range(1000):
        edges.append((name_a, name_b))

In [72]:
%timeit following("@Fox", edges)

981 µs ± 7.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


the code reads a file containing follower and followee names, builds a dictionary edge_dict representing the social network relationships, and measures the execution time of the following2 function for the user "@Fox" using the edge_dict.

In [58]:
edge_dict = {}
for line in open("/Users/yogeshgupta/Downloads/textanalysis/twitterName.txt"):
    name_a, name_b = line.strip().split(';')
    for i in range(1000):
        if name_a in edge_dict:
            edge_dict[name_a].append(name_b)
        else:
            edge_dict[name_a] = [name_b]

%timeit following2("@Fox", edge_dict)

87.2 ns ± 1.12 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)


the code converts an English word to its Latin counterpart by moving the consonants encountered before the first vowel to the end of the word and adding the suffix 'ay'.

In [60]:
# English to Latin
# On First occurance of an vowel the loop breaks
def translate(word):
    "Convert a word to latin."
    vowels = 'aeiouAEIOU'
    start = 0
    end = ''
    # loop over all characters in word
    for i, char in enumerate(word):
        # if this character is not a vowel
        if char not in vowels:
            # it is a consonant, so add it to the end.
            end += char
        # if it is a vowel
        else:
            # we set the starting position to 
            # the position of this character
            start = i
            break
    return word[start:] + end + 'ay'

translate('Practice')

'acticePray'

In [63]:
#Method 1
# the code checks whether a given string starts with a vowel or not by comparing its first character with a predefined list of vowel characters.
def starts_with_vowel(strings):
    vowels = 'aeiouAEIOU'
    if strings[0] in vowels:
        return True
    else:
        return False

starts_with_vowel('Amazing')
starts_with_vowel('Jack')

False

In [66]:
#Method 2
def starts_with_vowel(word):
    "Return True if WORD starts with a vowel, False otherwise."
    vowels = ('a', 'e', 'i', 'o', 'u', 'A', 'E', 'I', 'O', 'U')
    return word.startswith(vowels)

In [68]:
def add_suffix(word,suffix):
    word_suffix = word + suffix
    return word_suffix

add_suffix('luck','ily')

'luckily'

In [73]:
word = 'quick' 
add_suffix(word,'ly')

'quickly'

the code translates a word by rearranging its letters and adding a suffix based on whether the word starts with a vowel or not. The translation is achieved through recursive calls to the translate function with modified versions of the word.

In [71]:
def translate(word, suffix):
    if starts_with_vowel(word):
        return add_suffix(word, suffix)
    return translate(word[1:] + word[0], suffix)

translate('JkcEEEE','Amazing')

'EEEEJkcAmazing'