Notebook extract telephone numbers and contact names through patterns in metadatafiles.
Pattern "name" introduces a contact name. Pattern "pushname" introduces a whatsapp id name.
Through the country code of the telephone number the country of the contact is found.
In a last step, the list is sorted and duplicates of numbers are deleted. During this step the occurences of each number is counted and appended to the csv file.

In [10]:
'''
Move metadata .ldb files to specified location and change file extension to allow reading it with python reader object

src_path:   Path where indexeddb data is stored (e.g. .ldb metadata files)
des_pah:    Path where the metadata files are stored as textfiles
'''

import os
import shutil

src_path = 'C:\\Users\\TillK\\AppData\\Local\\Google\\Chrome\\User Data\\Default\\IndexedDB\\https_web.whatsapp.com_0.indexeddb.leveldb' #C:\\Users\\TillK\\prjcts\\Forensics\\https_web.whatsapp.com_0.indexeddb.leveldb_00'
des_path = './input/'

# check if folder exists, otherwise create
if not os.path.isdir(des_path):
    os.mkdir(des_path)
    
for filename in os.listdir(src_path):
    if filename.endswith('.ldb'):  # change the extension to txt file for further processing
        new_filename = os.path.splitext(filename)[0] + '.txt'
        old_path = os.path.join(src_path, filename)
        new_path = os.path.join(des_path, new_filename)
        shutil.copyfile(old_path, new_path)
        print(f"Copied file {filename} and converted to textfile.")

# Try except, when file is open e.g. in whatsapp

Copied file 000031.ldb and converted to textfile.
Copied file 000032.ldb and converted to textfile.
Copied file 000034.ldb and converted to textfile.
Copied file 000035.ldb and converted to textfile.
Copied file 000036.ldb and converted to textfile.


In [34]:
'''
Pre-defined patterns are extracted from metadata files and stored in equivalent named files in specified ouput directory.

in_dir:     Path were copied textfiles are located. Normally equivalent to 'des_path' from previous cell.
out_dir:    Path were extracted patterns are stored into equivalent named files.
patterns:   Dictionary for patterns to be extracted. Keys specify the filenames of extracted ouput of the relevent pattern.

extract_patterns(input_file, output_file, pattern):   Method to extract 'pattern' from 'input_file' to store results in 'output_file'

'''
import re
import os

out_dir = './output/'
in_dir = './input/'

no_names = r'Status|Error|Event|Single|Direct|Time|Sent|Update|Obaque|Record|Paused|Dialog|Select|Offset|Sender|Parent|Height|Hverified|Lverified|Connec'

patterns = {
    "name_double_plus" : r'[A-Z][a-z]{2,10}(\x20[A-Z][a-z]{2,10})+',
    "name_single" : r'(?<=[\x01\x14])(?!' + no_names + r')[[A-Z][a-z]{3,10}',
    "number_simple" : r'[0-9]{9,15}', # decrease false positives further by making length dependent on country code e.g. russian numbers are short
    "number_and_name_linked" : r'\x02.{0,100}@(?:s\.whatsapp\.net|c\.us).{0,100}\x02'
}

# check if folder exists, otherwise create
if not os.path.isdir(out_dir):
    os.mkdir(out_dir)

def extract_patterns(input_file, output_file, pattern, replace00 = True):
    with open(input_file, 'r', encoding='ISO-8859-1') as input_f:
        with open(output_file, 'a', encoding='ISO-8859-1') as output_f:
            # iterate over findings and write them to output_file
            for line in input_f:
                # delete 0x00 bytes from file 
                if replace00:
                    line = line.replace('\x00', '')
                itermatch = re.finditer(pattern, line)
                eof = False
                while not eof:
                    curmatch = ""
                    try:
                        curmatch = next(itermatch)
                        output_f.write(curmatch.group(0) + '\n')
                    except:
                        eof = True
                    
                    

# get all txt files in 'input_path' and extract patterns

# delete existing output_files
for f in os.listdir(out_dir):
    if not f.endswith(".txt"):
        continue
    os.remove(os.path.join(out_dir, f))

metadata_files = os.listdir(in_dir)

for filename in metadata_files:
    if filename.endswith('.txt'):
        input_file = os.path.join(in_dir, filename)
        for pattern in patterns:
            output_file = os.path.join(out_dir, pattern + ".txt")
            extract_patterns(input_file, output_file, patterns[pattern])


print(f'Pattern extraction complete! Output from {len(patterns)} patterns of {len(metadata_files)} files.')


Pattern extraction complete! Output from 4 patterns of 5 files.


In [38]:
'''
Further evaluation of 'number_and_name_linked' pattern. The previous extracted pattern extracted only whole lines which are likely to contain both numbers and names which correspond to each other. 
Thus, this cells extracts both instances and put them together. 
Furthermore, a method to add country codes with respect to the numbers is added

number_pattern:     pattern for numbers in the pre-extracted lines.
name_pattern:       pattern for names in the pre-extracted lines.
no_contact_pattern: pattern which indicates that contact is not saved in contact list of user.

preextracted_pattern_file:  This filepath should correspond to the file the extracted pattern is saved to in the previous cell.
csv_file_linked:   Filepath the output from putting together numbers and names is saved to.
csv_file_linked_with_county_codes:  Filepath the output with country_codes is saved to.

'''


import re
import csv

# Define the pattern for matching numbers and names
number_pattern = r"\d{10,16}@(?:s\.whatsapp\.net|c\.us)"
name_pattern = r"(?<=name..)[A-Z][a-z]+(\s([A-Z][a-z]+)*)?"
no_contact_pattern = r"pushname"

preextracted_pattern_file = out_dir + "number_and_name_linked.txt"
csv_file_linked = out_dir + "numbers_and_contacts.csv"
csv_file_linked_with_country_codes = out_dir  + "numbers_contacts_and_countrycode.csv"



def extract_linked_number_and_names(input_file, output_file = 'number_and_contacts.csv', delete_input_file_afterwards = True, extract_non_linked_numbers = False):
    # Open the input text file for reading
    with open(input_file, "r", encoding = 'ISO-8859-1') as file:
        # Create a CSV file for writing
        with open(output_file, "w", newline="") as csvfile:
            fieldnames = ["Number", "Name", "in_contacts"]
            writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
            writer.writeheader()
            hits = 0
            # Process each line in the input text file
            for line in file:
                line = line.strip()  # Remove leading/trailing whitespaces
                if line:  # Skip empty lines
                    # Extract number and name from the line using pattern matching
                    number = re.search(number_pattern, line)
                    name = re.search(name_pattern, line)

                    if number and (name or extract_non_linked_numbers):
                        # Remove the "@s.whatsapp.net" part from the number
                        number = number.group(0).replace("@s.whatsapp.net", "")
                        number = number.replace("@c.us", "")
                        if name:
                            name = name.group(0)
                        # Write the extracted data to the CSV file
                        in_contacts = True
                        if re.search(no_contact_pattern, line) or not name:
                            in_contacts = False
                        writer.writerow({"Number": number, "Name": name, "in_contacts": in_contacts})
                        hits += 1
            print(f"Wrote new file {output_file} with {hits} lines.")
    # remove pre-extracted pattern file
    if delete_input_file_afterwards: 
        os.remove(input_file)
        print(f"Deleted file: {input_file}.")
    # return filepath such that functions can be nested
    return output_file

def add_country_codes_to_numbers(input_file, output_file=None, delete_input_file_afterwards = True):
    if not output_file:
        output_file = os.path.splitext(input_file)[0] + '_with_country_code.txt'
    # read the mapping of country codes to phone number prefixes
    country_codes = {}
    # check if country-code.csv file exists
    if os.path.exists('country-codes.csv'):
        with open('country-codes.csv') as f:
            reader = csv.reader(f)
            for row in reader:
                country_codes[row[1]] = row[0]
        # remove entries with empty keys or values
        country_codes = {k: v for k, v in country_codes.items() if k != '' and v.strip() != ''}
        # add norway manually
        country_codes['47'] = 'NOR'

        # open the original CSV file and add a fourth column with the country code
        with open(input_file) as f, open(output_file, 'w', newline='') as g:
            reader = csv.reader(f)
            writer = csv.writer(g)
            header = next(reader)  # read the header row
            header.append('country_code')  # extend the header row
            writer.writerow(header)  # write the updated header row
            hits = 0
            for row in reader:
                phone_number = row[0]
                country_code = None
                for prefix, code in country_codes.items():
                    if phone_number.startswith(prefix):
                        country_code = code
                        break
                row.append(country_code)
                writer.writerow(row)
                hits += 1
        print(f"Wrote new file {output_file} with {hits} lines!")
        if delete_input_file_afterwards: 
            os.remove(input_file)
            print(f"Deleted file: {input_file}.")

    else:
        print("File country-code.csv file could not be found")
        print("Try downloading it from https://github.com/datasets/country-codes/blob/master/data/country-codes.csv\nAborting...")
    # return filepath such that functions can be nested
    return output_file

def add_number_occurences(input_file, output_file=None, delete_input_file_afterwards=True):
    '''
    Sort list, then remove dublicates and increase counter
    '''
    if not output_file:
        output_file = os.path.splitext(input_file)[0] + '_and_occurences.txt'
    # Read the CSV file
    with open(input_file, newline='') as f:
        reader = csv.DictReader(f)
        data = [row for row in reader]

    # Sort the data by the "country_code" column
    sorted_data = sorted(data[1:], key=lambda x: int(x['Number']))

    # setup unique data list with new 'occurences' field
    unique_data = [sorted_data[0]]
    unique_data[0]['occ'] = 1

    for item in sorted_data:
        if item['Number'] != unique_data[-1]['Number']:
            # add to unique data list
            unique_data.append(item)
            item['occ'] = 1
        else:
            # add occurences counter
            unique_data[-1]['occ'] += 1
            # check for additional information
            if item.get('Name') and item['Name']:
                if unique_data[-1]['Name'] != item['Name'] and not item['Name']:
                    print(f"There are two different names, namely {unique_data[-1]['Name']} and {item['Name']}, registered for one number.")
                else:
                    # replace entry with duplicate entry which holds more information
                    unique_data[-1]['Name'] = item['Name']
                    unique_data[-1]['in_contacts'] = item['in_contacts']
            
    # write the sorted data back to the CSV file
    with open(output_file, 'w', newline='') as f:
        fieldnames = ["Number", "Name", "in_contacts", "country_code", "occ"]
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(unique_data)
        print(f"Wrote new file: {output_file} with {len(unique_data)} lines.")

    # delete input file
    if delete_input_file_afterwards: 
        os.remove(input_file)
        print(f"Deleted file: {input_file}.")
    # return filepath such that functions can be nested
    return output_file


def sort_csv_file(input_file, col_to_sort='occ', output_file=None, delete_input_file_afterwards = True):
    '''
    Sorts a given csv file according to the col to sort.
    If output_file argument is not provided or None, file is saved to <input_file>_sorted.txt
    '''
    if not output_file:
        output_file = os.path.splitext(input_file)[0] + f"_sorted_by_{col_to_sort}.txt"
    # set up lambda function according to 'col_to_sort' due to necessary cast with columns 'occ', 'numbers'
    if col_to_sort == 'occ':
        cast = lambda c:  -int(c)
    elif col_to_sort == 'number':
        cast = lambda c: int(c[:4])
    elif col_to_sort == 'in_contacts':
        cast = lambda c: bool(c)
    else: # keep string format with 'Name', 'country_code'
        cast = lambda c: c
        
    # Read the CSV file
    with open(input_file, newline='') as f:
        reader = csv.DictReader(f)
        data = [row for row in reader]

    # Sort the data by the "country_code" column
    sorted_data = sorted(data[1:], key=lambda x: cast(x[col_to_sort]))

    # write the sorted data back to the CSV file
    with open(output_file, 'w', newline='') as f:
        fieldnames = ["Number", "Name", "in_contacts", "country_code", "occ"]
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(sorted_data)
        print(f"Wrote new file: {output_file} with {len(sorted_data)} lines.")
    
    if delete_input_file_afterwards: 
        os.remove(input_file)
        print(f"Deleted file: {input_file}.")
    # return filepath such that functions can be nested
    return output_file


evaluated_csv_file = sort_csv_file(add_number_occurences(add_country_codes_to_numbers(extract_linked_number_and_names(preextracted_pattern_file, extract_non_linked_numbers=False, delete_input_file_afterwards=False))))
occ_in_number_file = sort_csv_file(add_number_occurences(add_country_codes_to_numbers(input_file="output/number_simple.txt", delete_input_file_afterwards=False)))


# TODO out into output directory
# TODO append header if not appended!


Wrote new file number_and_contacts.csv with 58 lines.
Wrote new file number_and_contacts_with_country_code.txt with 58 lines!
Deleted file: number_and_contacts.csv.
Wrote new file: number_and_contacts_with_country_code_and_occurences.txt with 57 lines.
Deleted file: number_and_contacts_with_country_code.txt.
Wrote new file: number_and_contacts_with_country_code_and_occurences_sorted_by_occ.txt with 56 lines.
Deleted file: number_and_contacts_with_country_code_and_occurences.txt.
Wrote new file output/number_simple_with_country_code.txt with 19000 lines!
Wrote new file: output/number_simple_with_country_code_and_occurences.txt with 7794 lines.
Deleted file: output/number_simple_with_country_code.txt.
Wrote new file: output/number_simple_with_country_code_and_occurences_sorted_by_occ.txt with 7793 lines.
Deleted file: output/number_simple_with_country_code_and_occurences.txt.
