Let's explore the database using SQLite.

First of all, let's load the Pandas and SQLite3 packages, to support our data exploration. We will also load the OS package to support us on finding where the files are located.

In [None]:
import sqlite3
import pandas as pd
import os

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Now let's create a connector for the SQLite3.

In [None]:
conn = sqlite3.connect('ubuntu_data.db')
c = conn.cursor()

Now let's create the SQL tables. For the purposes of this notebook, two tables will be created. In case you want to work with all .csv files, just repeat the process for the third table / file.

In [None]:
c.execute('''
CREATE TABLE dialogs (
    folder INTEGER,
    dialogueID TEXT,
    date TEXT,
    sender TEXT,
    receiver TEXT,
    msg TEXT);
''')

In [None]:
c.execute('''
CREATE TABLE dialogs2 (
    folder INTEGER,
    dialogueID TEXT,
    date TEXT,
    sender TEXT,
    receiver TEXT,
    msg TEXT);
''')

Now let's use pandas method *read_csv* to read the source files and the method *to_sql* to export it to our SQL database. I have also renamed the columns to avoid issues with reserved words (such as *TEXT*).

In [None]:
# load the data into a Pandas DataFrame
data1 = pd.read_csv('/kaggle/input/ubuntu-dialogue-corpus/Ubuntu-dialogue-corpus/dialogueText.csv')

# rename columns
data1.rename(columns=
{
"from": "sender",
"to": "receiver",
"text": 'msg'
}, inplace=True)

# write the data to a sqlite table
data1.to_sql('dialogs', conn, if_exists='append', index=False)

Now we don't need the Pandas table *data1* anymore, let's clean it up - and confirm it is empty.

In [None]:
data1.drop(data1.index, inplace=True)
print(data1)

The next line of code repeats the process for the second table *data2*:

In [None]:
# load the data into a Pandas DataFrame
data2 = pd.read_csv('/kaggle/input/ubuntu-dialogue-corpus/Ubuntu-dialogue-corpus/dialogueText_196.csv')

# rename bad name columns
data2.rename(columns=
{
"from": "sender",
"to": "receiver",
"text": 'msg'
}, inplace=True)

# write the data to a sqlite table
data2.to_sql('dialogs2', conn, if_exists='append', index=False)

#clears the Pandas table
data2.drop(data2.index, inplace=True)

With all the preparations done, let's write our first SQL query. For that we have to use the SQL connector we have created.

Let's check 10 of the entries to have an idea of what our database looks like.

In [None]:
c.execute('''
SELECT * 
FROM dialogs
LIMIT 10;
''').fetchall()

Let's verify the number of distinct messages, folders and dialog IDs:

In [None]:
#print the quantity of messages

print('Number of messages = {}'.format(
c.execute('''
SELECT COUNT (msg) 
FROM dialogs;
''').fetchall()[0][0]
))

#print the quantity of folders

print('Number of folders = {}'.format(
c.execute('''
SELECT COUNT (DISTINCT folder) 
FROM dialogs;
''').fetchall()[0][0]
))

#print the quantity of dialog IDs

print('Number of dialog IDs = {}'.format(
c.execute('''
SELECT COUNT (DISTINCT dialogueID) 
FROM dialogs;
''').fetchall()[0][0]
))

In the next exploration we select one user at random *ActionParnisp1* and check the number of messages they have exchanged with each user different than *None* in each of the dialogs they appear.

In [None]:
c.execute('''
SELECT sender, receiver, dialogueID, COUNT(msg)
FROM dialogs
WHERE receiver <> 'None' AND sender = 'ActionParsnip1'
GROUP BY dialogueID, sender, receiver
ORDER BY receiver ASC, COUNT(msg) DESC;
''').fetchall()

Let's repeat the same exploration with the table *dialogs2*.

In [None]:
c.execute('''
SELECT * 
FROM dialogs2
LIMIT 10;
''').fetchall()

In [None]:
#print the quantity of messages

print('Number of messages = {}'.format(
c.execute('''
SELECT COUNT (msg) 
FROM dialogs2;
''').fetchall()[0][0]
))

#print the quantity of folders

print('Number of folders = {}'.format(
c.execute('''
SELECT COUNT (DISTINCT folder) 
FROM dialogs2;
''').fetchall()[0][0]
))

#print the quantity of dialog IDs

print('Number of dialog IDs = {}'.format(
c.execute('''
SELECT COUNT (DISTINCT dialogueID) 
FROM dialogs2;
''').fetchall()[0][0]
))

Interestingly enough, this second set of messages has more than one folder.

Let's verify if messages identified with a dialog ID are present in more than one folder.

In [None]:
c.execute('''
SELECT COUNT (DISTINCT folder), dialogueID 
FROM dialogs2
GROUP BY dialogueID
HAVING COUNT (DISTINCT folder) > 1
ORDER BY COUNT (DISTINCT folder) DESC;

''').fetchall()

In the next code window we will merge both of the tables *dialogs* and *dialogs2* into a third table *alldialogs*.

It was chosen to use an UNION to do this merging because, in the case there is repetition of values, we don't want to bring this repetition to the final result (and it would happen in case we used the INSERT INTO method for both tables). So we make an UNION and INSERT INTO an empty table.

For a detailed explanation on the different methods, refer to the site below:

[https://www.sqlitetutorial.net/sqlite-union/](https://www.sqlitetutorial.net/sqlite-union/)

In [None]:
c.execute('''CREATE TABLE alldialogs (
    folder INTEGER,
    dialogueID TEXT,
    date TEXT,
    sender TEXT,
    receiver TEXT,
    msg TEXT);''')

c.execute('''
INSERT INTO alldialogs
SELECT *
FROM(
SELECT folder, dialogueID, date, sender, receiver, msg
    FROM dialogs2
UNION 
SELECT folder, dialogueID, date, sender, receiver, msg
    FROM dialogs
)  
''').fetchall()

Let's do the same "measuring" of the resultant database:

In [None]:
#print the quantity of messages

print('Number of messages = {}'.format(
c.execute('''
SELECT COUNT (msg) 
FROM alldialogs;
''').fetchall()[0][0]
))

#print the quantity of folders

print('Number of folders = {}'.format(
c.execute('''
SELECT COUNT (DISTINCT folder) 
FROM alldialogs;
''').fetchall()[0][0]
))

#print the quantity of dialog IDs

print('Number of dialog IDs = {}'.format(
c.execute('''
SELECT COUNT (DISTINCT dialogueID) 
FROM alldialogs;
''').fetchall()[0][0]
))

But don't take my word for it! Let's use the INSERT INTO method and satisfy your Data Scientist curiosity.

We will create a dialogdummy table, make the measurements, and compare the results:

In [None]:
c.execute('''
CREATE TABLE dialogdummy AS SELECT * FROM dialogs;
''').fetchall()

c.execute('''
INSERT INTO dialogdummy
SELECT *
FROM dialogs2;  
''').fetchall()

In [None]:
#print the quantity of messages

print('Number of messages = {}'.format(
c.execute('''
SELECT COUNT (msg) 
FROM dialogdummy;
''').fetchall()[0][0]
))

#print the quantity of folders

print('Number of folders = {}'.format(
c.execute('''
SELECT COUNT (DISTINCT folder) 
FROM dialogdummy;
''').fetchall()[0][0]
))

#print the quantity of dialog IDs

print('Number of dialog IDs = {}'.format(
c.execute('''
SELECT COUNT (DISTINCT dialogueID) 
FROM dialogdummy;
''').fetchall()[0][0]
))

You can see that the number of messages for *dialogdummy* table is greater than the number for *alldialogs* (10250300 > 9187170).

In [None]:
#print the quantity of messages

print('Total number of messages in dialogdummy = {}'.format(
c.execute('''
SELECT COUNT (msg) 
FROM dialogdummy;
''').fetchall()[0][0]
))

#print the quantity of messages

print('Total number of messages in alldialogs = {}'.format(
c.execute('''
SELECT COUNT (msg) 
FROM alldialogs;
''').fetchall()[0][0]
))

#print the quantity of messages

print('Number of distinct messages in dialogdummy = {}'.format(
c.execute('''
SELECT COUNT (DISTINCT msg) 
FROM dialogdummy;
''').fetchall()[0][0]
))

#print the quantity of messages

print('Number of distinct messages in alldialogs  = {}'.format(
c.execute('''
SELECT COUNT (DISTINCT msg) 
FROM alldialogs;
''').fetchall()[0][0]
))

Even though the total number of messages is higher for the *dialogdummy* table (created by a pure INSERT INTO), the number of distinct messages is the same between both tables, suggesting that the INSERT INTO method passed repeated content.

Let's delete the *dialogdummy* table and move on.

In [None]:
c.execute('''
DROP TABLE dialogdummy
''')

If you wanted to get rid of the first two tables, you would uncomment the code below an run it.

In [None]:
#c.execute('''
#DROP TABLE dialogs
#''')

#c.execute('''
#DROP TABLE dialogs2
#''')

Suppose you ran all the queries you wanted to, merging and concatenating tables, grouping columns and etc. Now you want to export the resulting database back to an csv file. You can do it following the codes in the window below:

In [None]:
clients = pd.read_sql('''
SELECT *
FROM alldialogs
ORDER BY dialogueID ASC;
''', conn)

clients.to_csv('/kaggle/working/results.csv', index=False)