This program opens a txt file with some unstructured text and extracts all computer-science-related terms and definitions from it into a database. It was actually an English assignment for a friend of mine who is a computer scientist. He had to learn all this definitions by heart and then retell them during the test. Since the file is so big and written in plane text, it was hard for my friend to navigate through it quickly so as to cheat during the test. That's why I have written this program that reads through the file, automatically detects all words and definitions and stores them in a database with the ability to find the needed words quickly. The problem was that the file itself turned out to be unstructured, and detecting words and definitions was not as easy as I had thought. Some typos had been made by a teacher, as well as there were no distinct boundaries between definitions. So I had to come up with an algorithm that would extract, clean and load the data into a database.    

IMPORTING LIBRARIES, CREATING DATABASE AND CONNECTING TO IT

In [1]:
import sqlite3
import re

connection = sqlite3.connect('dictionary.sqlite')
cur = connection.cursor()

definition = None
id_count = 0
true_count = 0
false_count = 0
keys = ('abr', 'adr', 'ajd', 'avd', 'arb', 'ard', 'n', 'v', 'adj', 'adv', '\(n\)', '\(.+\)', 'a\s|\san\s|\sthe')

cur.executescript('''DROP TABLE IF EXISTS Dictionary;
	CREATE TABLE Dictionary(
	word TEXT NOT NULL PRIMARY KEY UNIQUE, definition TEXT, category TEXT)''')

<sqlite3.Cursor at 0x1ab09e46dc0>

OPENING RAW .TXT FILE FROM WHICH WE WILL EXTRACT OUR WORDS AND DEFINITIONS

In [2]:
filename = 'glossary.txt'
try:
	file = open(filename, 'r')
	print('File %s was opened successfully\n' % filename)
except:
	print('ERROR: File %s was not found\n' % filename)
	quit()

File glossary.txt was opened successfully



AUTOMATICALLY LOCATING AND EXTRACTING WORDS AND THEIR DEFINITIONS FROM A .TXT FILE, CLEANING AND LOADING THEM INTO THE DATABASE

In [3]:
#SPLITTING TEXT INTO WORDS 
for line in file:
	line = line.rstrip()
	line_work = line.replace('.', ' ').rstrip().lower()
	words = line_work.split()
	if len(words) < 2: continue
	id_count += 1

	presence = False
	for key in keys:
		query = '\s%s\s' % key
		match = re.findall(query, line_work)
		if len(match) > 0:
			true_count += 1
			presence = True
			identifier = line_work.index(match[0])
			break

	if not presence:		
		print('\nERROR: COULD NOT INTERPRETE: ', line_work.strip())
		false_count += 1

    #AUTOMATICALLY DETECTING WORDS AND THEIR DEFINITIONS
	term = line[:identifier].strip().replace('.', ' ')
	categories = [(' n ','noun'), (' (n) ','noun'), (' v ','verb'), (' (v) ','verb'), (' adj ','adjective'), (' abr ','abbreviation'), (' arb ','abbreviation'), (' adr ','abbreviation')]
	for symbol, meaning in categories:
		if symbol in line_work:
			category = meaning
			line_work = line_work.replace(symbol, '')
			definition = line_work[identifier:].strip().capitalize()
			break
		else:
			category = None
			definition = line_work[identifier:].strip().capitalize()
    
    #DETECTING MISTAKES AND CLEANING THE DATA 
	if definition.startswith('('):
		abr_pos = definition.index(')')
		abr = definition[:abr_pos + 1]
		definition = definition[abr_pos + 1:].strip().capitalize()
		term = term + ' ' + abr.upper()

	if (term[-1] == '1') and ('2' in definition or '2.' in definition):
		term = term[:-1]
		definition = '1. ' + definition

	term = term[0].upper() + term[1:]
	definition = definition.replace('  ', '. ')
    
    #INSERTING CLEANED WORDS AND DEFINITIONS INTO A DATABASE
	cur.execute('''INSERT INTO Dictionary(word, definition, category) VALUES (?, ?, ?)''', (term, definition, category))
	connection.commit()

print('-----------------\n  RESULTS:\n  Total extracted words: %i\n-----------------\n' % id_count)
print('========================================= DICTIONARY 3000 ==============================================\n')
print('Dictionary %s is ready to be used.' % filename)

-----------------
  RESULTS:
  Total extracted words: 612
-----------------


Dictionary glossary.txt is ready to be used.


CREATING A SIMPLE USER INTERFACE FOR OUR DICTIONARY

In [None]:
while True:
	print('\nEnter the word related to programming and you will get a definition.\nYou can search for as many words as you want to.\nPlease enter one word at a time.\nYou can also type a part of the word and I will try to find it.')
	while True:
        #ASKING USER FOR A WORD
		word = input('\nEnter the word: ').strip().lower()

        #SEARCHING AND RETURNING THE WORD
		cur.execute('''SELECT word FROM Dictionary ORDER BY word''')
		wordlist = cur.fetchall()
		success = False
		count = 1
		print('----------------------------------------------------------------------------------\nEntered word: %s\n' % word)
		for value in wordlist:
			if word in value[0].lower():
				success = True
				cur.execute('''SELECT definition, category FROM Dictionary WHERE word = ? ORDER BY word''', (value[0],))
				returns = cur.fetchall()
				if returns[0][1] == None:
					print('\n%i. %s: %s' % (count, value[0].upper(), returns[0][0]))
					count += 1
				else:
					print('\n%i. %s (%s): %s' % (count, value[0].upper(), returns[0][1], returns[0][0]))
					count += 1
		if not success: 
			print('\nI have not found that word. Please, try again')
	continue


Enter the word related to programming and you will get a definition.
You can search for as many words as you want to.
Please enter one word at a time.
You can also type a part of the word and I will try to find it.



Enter the word:  database


----------------------------------------------------------------------------------
Entered word: database


1. DATABASE (noun): A file of structured data

2. DATABASE PROGRAM (noun): An applications program used to store, organize and retrieve a large collection of data. among other facilities, data can be searched, sorted and updated

3. RELATIONAL DATABASE (noun): A database system that maintains separate, related files (tables), but combines data elements from the files for queries and reports



Enter the word:  pizza


----------------------------------------------------------------------------------
Entered word: pizza


I have not found that word. Please, try again



Enter the word:  code


----------------------------------------------------------------------------------
Entered word: code


1. ASCII CODE (noun): A standard system for the binary representation of characters. ascii, which stands for american standard code for information interchange, permits computers from different manufacturers to exchange data, aspect ratio width of the screen divided by its height, e g. 4:3 (standard pc monitor or tv set) and 16:9 (high-definition tv), assembler a special program that converts a program written in a low-level language into machine code, assembly language a low-level language that uses abbreviations, such as add, sub and mpy, to represent instructions

2. BAR CODE READER (noun): A specialized scanner used to read price labels in shops

3. BINARY CODE (noun): A code made of just two numbers (0 and 1)

4. MACHINE CODE (noun): Binary code numbers; the only language that computers can understand directly

5. SOURCE CODE (noun): 1. computer instructions written in a high-le


Enter the word:  python


----------------------------------------------------------------------------------
Entered word: python


1. PYTHON (noun): A popular high-level programming language
