**Analyzing Chat Log Data Sets For Digital Forensics Using Data Science**

---



# Background Information

Digital Forensics is a science that focuses on identifying, gathering, processing, analysing, and 
reporting of data stored electronically. Computer emails, text messages, images, documents, and internet histories 
are all very important pieces of information that can be extracted from electronic devices.
 
An analyst will have to examine these computer files from a user or suspect and try and preserve the 
evidence in its most original form. Any alarming piece of information from these computer files can be used 
as evidence. 

Usually, the size of the electronic data and files are huge, and will take a great deal of time to go through 
every file or text message manually. This is where Data Science can be very beneficial to Digital Forensics.

Often times there are situations where you may have to view a users chat logs. You may have to read through 
their messages to see their conversation, who their communicating with, potential partners, or other key 
pieces of evidence. 




# What We Will Be Doing

We will be giving you some tips on how exactly we can condense and organize chat logs down to a smaller size 
which can be more easily analyzed. We will be explaining how to organize the data in a better fashion, 
remove duplicates, clean bad or useless data, and more.

For this example we will be viewing a public Discord server, specifically of four different programming language communities such as Python, Go, Clojure, and Racket. These servers are public and serve as a resource for getting 
technical help, sharing knowledge, and also real-time conversations between fellow community members.

# Learning Goals

Learning Goals for this tutorial:

*   Use Pandas to manipulate a data set
*   Cleaning data
*   Analyze/sort through the data set
*   Be able to pull specific details from the data set







# Getting Started

To first get started, we will import some libraries that will help us process and go through data sets.
We will be using the 'pandas' library, an open source data analysis and manipulation tool, built on top of the 
Python programming langage.

We will also use the 'numpy' library to help us use arrays to better organize our data.

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

Our first line of code here is to read in our data set using the pandas library we imported. 
Here we are using one month of chat logs to start with.

We are going to be storing our data set of chat logs into a variable called "df" and will use the 'pandas' library to read the xml file.

In [None]:
df = pd.read_xml('pythongeneralDec2019.xml')

Lets take a look at what our data consists of using the "info" method. 

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30944 entries, 0 to 30943
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   team_domain   1 non-null      object
 1   channel_name  1 non-null      object
 2   start_date    1 non-null      object
 3   end_date      1 non-null      object
 4   ts            30940 non-null  object
 5   user          30940 non-null  object
 6   text          30940 non-null  object
dtypes: object(7)
memory usage: 1.7+ MB


Here we can see a more general view of what our data set consists of. This shows our different column names as well as how many values are in these columns. Notice the first four columns, there is only one piece of data in these columns.

Using just the "df" command, now we see exactly what our data in the data set consists of as a whole.


In [None]:
df

Unnamed: 0,team_domain,channel_name,start_date,end_date,ts,user,text
0,Python,,,,,,
1,,python-general,,,,,
2,,,2019-12-01T00:56:23.288000,,,,
3,,,,2019-12-31T18:29:57.380000,,,
4,,,,,2019-12-01T00:56:23.288000,Ryden,where is the name of the file stored on the file?
...,...,...,...,...,...,...,...
30939,,,,,2019-12-31T18:24:58.198000,Azyriah,hi @GokturkSM
30940,,,,,2019-12-31T18:25:37.421000,Areesha,Anyone have some math background? Trying to im...
30941,,,,,2019-12-31T18:26:17.603000,Areesha,My question is... what exactly is a cyclic gro...
30942,,,,,2019-12-31T18:29:15.676000,Ailany,You need to know what exceptions your code may...


# Data Cleaning

Notice the first four columns again, we only have one piece of information in each of these columns, and they aren't that useful to us. We do not need to work with the name of the discord channel as well as the start and end date/time. 

We will start by getting rid of these first four columns and rows of data to simplify and condense down our data set.

In [None]:
df = df.drop(range(0,4))

We used the "drop" method to get rid of our first four rows of data.

Lets take a look at what it looks like now.

In [None]:
df

Unnamed: 0,team_domain,channel_name,start_date,end_date,ts,user,text
4,,,,,2019-12-01T00:56:23.288000,Ryden,where is the name of the file stored on the file?
5,,,,,2019-12-01T01:07:04.875000,Jayvien,say what?
6,,,,,2019-12-01T01:10:54.022000,Kaliope,Where would you guys recommend I learn kotlin ...
7,,,,,2019-12-01T01:13:29.732000,Naely,I used the official Kotlin docs they have an e...
8,,,,,2019-12-01T01:16:43.656000,Jayvien,do they?
...,...,...,...,...,...,...,...
30939,,,,,2019-12-31T18:24:58.198000,Azyriah,hi @GokturkSM
30940,,,,,2019-12-31T18:25:37.421000,Areesha,Anyone have some math background? Trying to im...
30941,,,,,2019-12-31T18:26:17.603000,Areesha,My question is... what exactly is a cyclic gro...
30942,,,,,2019-12-31T18:29:15.676000,Ailany,You need to know what exceptions your code may...


Our data set looks a little cleaner now, however, now we have four rows full of empty or null data.

Lets now get rid of these first four columns to condense our table down into a more readable table.

We will use the same "drop" method but we will specify what columns exactly we want to drop as well as use "axis=1" to specify we want to drop a column instead of a rows.

In [None]:
df = df.drop(labels="team_domain", axis=1)

In [None]:
df

Unnamed: 0,channel_name,start_date,end_date,ts,user,text
4,,,,2019-12-01T00:56:23.288000,Ryden,where is the name of the file stored on the file?
5,,,,2019-12-01T01:07:04.875000,Jayvien,say what?
6,,,,2019-12-01T01:10:54.022000,Kaliope,Where would you guys recommend I learn kotlin ...
7,,,,2019-12-01T01:13:29.732000,Naely,I used the official Kotlin docs they have an e...
8,,,,2019-12-01T01:16:43.656000,Jayvien,do they?
...,...,...,...,...,...,...
30939,,,,2019-12-31T18:24:58.198000,Azyriah,hi @GokturkSM
30940,,,,2019-12-31T18:25:37.421000,Areesha,Anyone have some math background? Trying to im...
30941,,,,2019-12-31T18:26:17.603000,Areesha,My question is... what exactly is a cyclic gro...
30942,,,,2019-12-31T18:29:15.676000,Ailany,You need to know what exceptions your code may...


As you can see we dropped the "team_domain" column successfully. 

Lets get rid of the last three empty columns then take a look at how our data set table looks.

In [None]:
df = df.drop(labels="channel_name", axis=1)
df = df.drop(labels="start_date", axis=1)
df = df.drop(labels="end_date", axis=1)

In [None]:
df

Unnamed: 0,ts,user,text
4,2019-12-01T00:56:23.288000,Ryden,where is the name of the file stored on the file?
5,2019-12-01T01:07:04.875000,Jayvien,say what?
6,2019-12-01T01:10:54.022000,Kaliope,Where would you guys recommend I learn kotlin ...
7,2019-12-01T01:13:29.732000,Naely,I used the official Kotlin docs they have an e...
8,2019-12-01T01:16:43.656000,Jayvien,do they?
...,...,...,...
30939,2019-12-31T18:24:58.198000,Azyriah,hi @GokturkSM
30940,2019-12-31T18:25:37.421000,Areesha,Anyone have some math background? Trying to im...
30941,2019-12-31T18:26:17.603000,Areesha,My question is... what exactly is a cyclic gro...
30942,2019-12-31T18:29:15.676000,Ailany,You need to know what exceptions your code may...


# Formatting

Our table looks much cleaner now, we consolidated the data down into data we want to actually look at.

Lets next rename the 'ts' column into 'time' just to better clarify. We will use the "rename" method to do so.

In [None]:
df.rename(columns = {'ts':'time'}, inplace = True)

In [None]:
df

Unnamed: 0,time,user,text
4,2019-12-01T00:56:23.288000,Ryden,where is the name of the file stored on the file?
5,2019-12-01T01:07:04.875000,Jayvien,say what?
6,2019-12-01T01:10:54.022000,Kaliope,Where would you guys recommend I learn kotlin ...
7,2019-12-01T01:13:29.732000,Naely,I used the official Kotlin docs they have an e...
8,2019-12-01T01:16:43.656000,Jayvien,do they?
...,...,...,...
30939,2019-12-31T18:24:58.198000,Azyriah,hi @GokturkSM
30940,2019-12-31T18:25:37.421000,Areesha,Anyone have some math background? Trying to im...
30941,2019-12-31T18:26:17.603000,Areesha,My question is... what exactly is a cyclic gro...
30942,2019-12-31T18:29:15.676000,Ailany,You need to know what exceptions your code may...


Our table is now more simpiler to look at. However if you take a look at the time column, with the format, it is very hard to read what it says.

Let's now re-format this column to be able to read the time each message was sent.

We will use the 'pandas' library once again as well as the "to_datetime" method to re-format this column in a more readable way.

In [None]:
df['time'] = pd.to_datetime(df['time'], format="%Y-%m-%d")

In [None]:
df

Unnamed: 0,time,user,text
4,2019-12-01 00:56:23.288,Ryden,where is the name of the file stored on the file?
5,2019-12-01 01:07:04.875,Jayvien,say what?
6,2019-12-01 01:10:54.022,Kaliope,Where would you guys recommend I learn kotlin ...
7,2019-12-01 01:13:29.732,Naely,I used the official Kotlin docs they have an e...
8,2019-12-01 01:16:43.656,Jayvien,do they?
...,...,...,...
30939,2019-12-31 18:24:58.198,Azyriah,hi @GokturkSM
30940,2019-12-31 18:25:37.421,Areesha,Anyone have some math background? Trying to im...
30941,2019-12-31 18:26:17.603,Areesha,My question is... what exactly is a cyclic gro...
30942,2019-12-31 18:29:15.676,Ailany,You need to know what exceptions your code may...


Now we can read the date and time each message was sent in a more readable view.

We can start analyzing some of these messages.

# Analysis

Since our data frame is now more organized and cleaned up, we can start analyzing some of the data stored.

Lets take a look at how much data in our columns there is to go through by using the "count" method.

In [None]:
df.count()

time    30940
user    30940
text    30940
dtype: int64

The result shows that we have 30,940 different text messages in this data set, which is a lot to manually look through.

We can analyze this information in a much easier way by using code.

Lets first take a look at the "user" column. 

We can count how many messages each user has sent by incorporating the "groupby" method along with the "count" method.

In [None]:
df.groupby(['user'])['user'].count()

user
Aashvi         17
Abbott         37
Abdalahe        6
Abdelfetah      5
Abdelmadjid    76
               ..
Zuma            1
Zurisadai       4
Zuriya         25
Zviad           1
Zyanna          5
Name: user, Length: 828, dtype: int64

With this table, we can see each user and how many messages they have sent in the Discord.

Let's start analyzing the table and find which user has sent the most messages in total.

We can add on the "max" method to our code and it will return the maximum value of messages sent for one user.

In [None]:
df.groupby(['user'])['user'].count().max()

2626

So we can see the user with the highest amount of messages has sent "2626" messages in this data set.

Lets find out who sent these messages.

We will use the "value_counts" method along with the "idxmax" method to go through each row in the 'user' column and count which index has the max or highest value.

In [None]:
df['user'].value_counts().idxmax()

'Xochilt'

This shows us the user who sent the most messages in the data set.

Lets also look at the other top users who has sent the most messages. 

We will use the "head" method which return the top five users unless specified otherwise.

In [None]:
df['user'].value_counts().head()

Xochilt     2626
Kosta       1280
Naely       1067
Adirah       845
Andersyn     662
Name: user, dtype: int64

This table now displays the five users who messaged the most in our data set.

If we have someone we want to analyze more specifically, we can also specify which user we want to take a look at. We can then find how many messages they sent, and more.

Lets find out how many messages "Ryden" sent for example.

In [None]:
df['user'].value_counts()['Ryden']

113

Find which user asked the most questions

In [None]:
df['text'].value_counts()['?']

60

In [None]:
df.groupby(['user'])['text'].value_counts()['?']

KeyError: ignored

# Challenges

Add activites for users to learn interactively

Add sample data set have users do simple tasks

In [None]:
score = 0
questions = 5

In [None]:
print("Find how many messages 'Kaliope' sent.")

Find how many messages 'Kaliope' sent.


In [None]:
answer1 = input()

df['user'].value_counts()['Kaliope']


In [None]:
if answer1 == "df['user'].value_counts()['Kaliope']":
  print("Correct!")
  score += 1
else:
  print("Incorrect")
  print("One way to solve this is:")
  print("")
  print("df['user'].value_counts()['Kaliope']")

Correct!


In [None]:
score = round(score/questions*100)
print('Quiz completed.')
print('Your final score is {}%.'.format(score))

Quiz completed.
Your final score is 10%.


In [None]:
if int(score) >= 70:
    print("Congrats, you passed!")
else:
    print("Sorry, you failed.")

Sorry, you failed.
