# ‚öôÔ∏èüõ†Ô∏è Parsing Data from player.txt files

This is an example of how we extracted the data from the Fallout telemetry system. We count with a log file per player, where each line represents different actions, such as dialog actions, quests, and attack logs.

We will need to write a script to parse and process the data and then convert it into a data frame. For this reason, we will divide this notebook into various sections as a way to demonstrate how we can transform the unstructured data from the **telemetry system**.

Before starting let's check how the files are stored.

![files](Images/files.png "Files")

Addtional Notes:

```
#143  -> #191
On Oct 27, I assigned #143 subject to #191 

106 uses God mode 

127 and 113 seems to have extremely large range in introhouse location
```

As we see, this is not the most optimal way to allocate the data, since on a big scale the log info of a videogame should be stored in a database, not in folders, or in a Sharepoint like in the given example.

This is how each .txt file looks inside:

![text file](Images/Text.png "Fines")

There are multiple records unstructured, in a log file per player, where each line represents different actions followed by attributes related to them. Like the next ones:

```Shell
['Position_Introhouse' 'Quest' 'Dialogue' 'ObjectOnActivate' 'InteractionObject' 'InteractionDoor' 'InteractionNPC'

'Creature Giant Rat attacked first' 'InteractionContainer' 'Position_Outside' 'Position_Bar' 'InteractionInterior'

'PlayerLootedItem' 'CraftingTable' 'Position_SheriffOffice' 'PlayerJumped' 'Position_AbandonedHouse' 'Attacked'

'Player health: <no name>' 'Player killed' 'Player looted dead' 'PlayerEquipped' 'PlayerUnequipped' 'Stat']
```

Also, having a lot of types of actions we'll do the next segments


## üßô‚Äç‚ôÇÔ∏è Quest Data

This data is represented in the beginning of the line as "Quest", for analysis purposes we are only going to use:

* Player: Id of the player

* Name: Name of the NPC with whom you need to interact in the quest (Name of the Quest)

* Status: Wheather the quest is completed or not

* Timestamp: Time when action was recorded, which is a counter of every 20 milliseconds

In [None]:
# Dependencies required
import pandas as pd

In [None]:
# Function to extract player quest info
def questext(path):
    # Empty Quests DF
    questsdf = pd.DataFrame(columns = ['player','quest_name', 'status', 'timestamp'])

    with open(path, encoding = 'cp1252') as f:

        # Extract each line of the txt as a string
        lines = f.readlines()

        # Measure number of lines in player record
        long = len(lines)

        # Iterate over each status of the player
        for i in range(long):

            # Replace \n value in each string of file
            line = lines[i].replace('\n',"")
            
            # Validate to don't have useless values, E.G: [""]
            if len(line) > 1:
                
                # Split the row into strings, split on ‚Äò,‚Äô and create a list for each line
                line = line.split(",")

                # Detect/subset type of activity log
                activity = line[0]
                
                # In case the activity is "Quest"
                if activity == "Quest":
                    player = line[1]
                    timestamp = line[2]
                    questname = line[3]
                    queststatus = line[4]

                    # Eliminate blanks spaces Status of the Quest
                    queststatus = queststatus.replace(" ","")

                    #  Validate the status quest is set as started of completed,
                    # discard the record of steps of uncompleted quests
                    if (queststatus == "Started") or (queststatus == "Completed"):

                        # New row to add to the dataframe
                        newrow = pd.Series([player,questname, queststatus, timestamp], index = questsdf.columns)

                        # Updated dataframe
                        questsdf = questsdf.append(newrow, ignore_index = True)

    return questsdf

## ü¶ú Dialogue Data

This data is represented at the beginning of the line as "Dialogue", for analysis purposes we are only going to use:

* Player: Id of the player

* Character: Name of the NPC with whom the player spoke

* Utterance: String containing the dialogue text

* Timestamp: Time when action was recorded, which is a counter of every 20 milliseconds

In [None]:
# Function to extract player dialogue info
def dialoguext(path):
    # Empty Dialogue DF
    dialoguedf = pd.DataFrame(columns = ['player', 'character', 'utterance', 'timestamp'])

    with open(path, encoding = 'cp1252') as f:

        # Extract each line of the txt as a string
        lines = f.readlines()

        # Measure number of lines in player record
        long = len(lines)

        # Iterate over each status of the player
        for i in range(long):

            # Replace \n value in each string of file
            line = lines[i].replace('\n',"")
            
            # Validate to don't have useless values, E.G: [""]
            if len(line) > 1:
                
                # Split the row into strings, split on ‚Äò,‚Äô and create a list for each line
                line = line.split(",")

                # Detect/subset type of activity log
                activity = line[0]
                
                # In case the activity is "Dialogue"
                if activity == "Dialogue":
                    player = line[1]
                    timestamp = line[2]
                    character = line[4]
                    utterance = line[5]


                    # New row to add to the dataframe
                    newrow = pd.Series([player, character, utterance, timestamp], index = dialoguedf.columns)

                    # Updated dataframe
                    dialoguedf = dialoguedf.append(newrow, ignore_index = True)

    return dialoguedf

# ü™ì Attacked info

This data is represented at the beginning of the line as "Attacked", for analysis purposes we are only going to use:

* Player: Id of the player

* Timestamp: Time when action was recorded, which is a counter of every 20 milliseconds

* Object_attacked: Object attacked by the player

* Reason: Quest related / unmotivated / self defense

In [None]:
# Function to extract player attack info
def attackedext(path):
    # Empty Attacked DF
    attackdf = pd.DataFrame(columns = ['player', 'timestamp', 'object_attacked', 'reason'])

    with open(path, encoding = 'cp1252') as f:

        # Extract each line of the txt as a string
        lines = f.readlines()

        # Measure number of lines in player record
        long = len(lines)

        # Iterate over each status of the player
        for i in range(long):

            # Replace \n value in each string of file
            line = lines[i].replace('\n',"")
            
            # Validate to don't have useless values, E.G: [""]
            if len(line) > 1:
                
                # Split the row into strings, split on ‚Äò,‚Äô and create a list for each line
                line = line.split(",")

                # Detect/subset type of activity log
                activity = line[0]
                
                # In case the activity is "Attacked"
                if activity == "Attacked":
                    player = line[1]
                    timestamp = line[2]
                    object_attacked = line[3]
                    reason = line[4]


                    # New row to add to the dataframe
                    newrow = pd.Series([player, timestamp, object_attacked, reason], index = attackdf.columns)

                    # Updated dataframe
                    attackdf = attackdf.append(newrow, ignore_index = True)

    return attackdf

# ü•æ Movement Data

This data is represented at the beginning of the line as "Position_Introhouse", "Position_Outside", "Position_Bar", "Position_SheriffOffice" or "Position_AbandonedHouse". For analysis purposes we are only going to use:

* Location Name: Area of the open-world landscape where data was recorded

* Player: Id of the player

* Timestamp: Time when action was recorded, which is a counter of every 20 milliseconds

* Position X: player location relative to x axis

* Position Y: player location relative to y axis

* Position Z: player location relative to z axis

* Orientation X: Orientation of the camera vector during recording relative to x axis

* Orientation Y: Orientation of the camera vector during recording relative to y axis

* Orientation Z: Orientation of the camera vector during recording relative to z axis

* Health: Status of player health

In [None]:
# Function to extract player position info
def positionext(path):
    # Empty Position DF
    positiondf = pd.DataFrame(columns = ['location','player','timestamp','pos_x','pos_y','pos_z','orient_x','orient_y','orient_z','health'])

    with open(path, encoding = 'cp1252') as f:
        # Extract each line of the txt as a string
        lines = f.readlines()

        # Measure number of lines in player record
        long = len(lines)

        # Iterate over each status of the player
        for i in range(long):

            # Replace \n value in each string of file
            line = lines[i].replace('\n',"")
            
            # Validate to don't have useless values, E.G: [""]
            if len(line) > 1:
                
                # Split the row into strings, split on ‚Äò,‚Äô and create a list for each line
                line = line.split(",")

                # Detect/subset type of activity log
                activity = line[0]
                
                # In case the activity is "Position"
                if 'Position' in activity:
                    location = line[0].split('_',1)[1] # extract the exact location without the position header
                    player = line[1]
                    timestamp = line[2]
                    pos_x = line[3]
                    pos_y = line[4]
                    pos_z = line[5]
                    orient_x = line[6]
                    orient_y = line[7]
                    orient_z = line[8]
                    health = line[9]

                    # New row to add to the dataframe
                    newrow = pd.Series([location,player,timestamp,pos_x,pos_y,pos_z,orient_x,orient_y,orient_z,health], index = positiondf.columns)

                    # Updated dataframe
                    positiondf = positiondf.append(newrow, ignore_index = True)

    return positiondf

# üü¢ Object Interaction info

This data is represented at the beginning of the line as "InteractionObject", for analysis purposes we are only going to use:

* Location Name: Area of the open-world landscape where data was recorded

* Player: Id of the player

* Object: Item used by the player

* Position X: player location relative to x axis

* Position Y: player location relative to y axis

* Position Z: player location relative to z axis

In [None]:
# Function to extract player object interaction info
def intobjext(path):
    # Empty object interaction DF
    intobjdf = pd.DataFrame(columns = ['location','player','object','pos_x','pos_y','pos_z'])

    with open(path, encoding = 'cp1252') as f:

        # Extract each line of the txt as a string
        lines = f.readlines()

        # Measure number of lines in player record
        long = len(lines)

        # Iterate over each status of the player
        for i in range(long):

            # Replace \n value in each string of file
            line = lines[i].replace('\n',"")
            
            # Validate to don't have useless values, E.G: [""]
            if len(line) > 1:
                
                # Split the row into strings, split on ‚Äò,‚Äô and create a list for each line
                line = line.split(",")

                # Detect/subset type of activity log
                activity = line[0]
                
                # In case the activity is "ObjectOnActivate"
                if activity == "InteractionObject":
                    location = line[1]
                    player = line[2]
                    obj = line[3]
                    pos_x = line[4]
                    pos_y = line[5]
                    pos_z = line[6]


                    # New row to add to the dataframe
                    newrow = pd.Series([location,player,obj,pos_x,pos_y,pos_z], index = intobjdf.columns)

                    # Updated dataframe
                    intobjdf = intobjdf.append(newrow, ignore_index = True)

    return intobjdf

# üö™ Door Interaction info

This data is represented at the beginning of the line as "InteractionDoor", for analysis purposes we are only going to use:

* Location Name: Area of the open-world landscape where data was recorded

* Player: Id of the player

* Position X: player location relative to x axis

* Position Y: player location relative to y axis

* Position Z: player location relative to z axis

In [None]:
# Function to extract player door interaction info
def intdoorext(path):
    # Empty door interaction DF
    intdoordf = pd.DataFrame(columns = ['location','player','pos_x','pos_y','pos_z'])

    with open(path, encoding = 'cp1252') as f:

        # Extract each line of the txt as a string
        lines = f.readlines()

        # Measure number of lines in player record
        long = len(lines)

        # Iterate over each status of the player
        for i in range(long):

            # Replace \n value in each string of file
            line = lines[i].replace('\n',"")
            
            # Validate to don't have useless values, E.G: [""]
            if len(line) > 1:
                
                # Split the row into strings, split on ‚Äò,‚Äô and create a list for each line
                line = line.split(",")

                # Detect/subset type of activity log
                activity = line[0]
                
                # In case the activity is "InteractionDoor"
                if activity == "InteractionDoor":
                    location = line[1]
                    player = line[2]
                    pos_x = line[3]
                    pos_y = line[4]
                    pos_z = line[5]


                    # New row to add to the dataframe
                    newrow = pd.Series([location,player,pos_x,pos_y,pos_z], index = intdoordf.columns)

                    # Updated dataframe
                    intdoordf = intdoordf.append(newrow, ignore_index = True)

    return intdoordf

# üßü NPC Interaction info

This data is represented at the beginning of the line as "InteractionNPC", for analysis purposes we are only going to use:

* Location Name: Area of the open-world landscape where data was recorded

* Player: Id of the player

* NPC name: Name of the NPC with whom the player interacted

* Position X: player location relative to x axis

* Position Y: player location relative to y axis

* Position Z: player location relative to z axis

In [None]:
# Function to extract player NPC interaction info
def intnpcext(path):
    # Empty NPC interaction DF
    intnpcdf = pd.DataFrame(columns = ['location','player','npc_name','pos_x','pos_y','pos_z'])

    with open(path, encoding = 'cp1252') as f:

        # Extract each line of the txt as a string
        lines = f.readlines()

        # Measure number of lines in player record
        long = len(lines)

        # Iterate over each status of the player
        for i in range(long):

            # Replace \n value in each string of file
            line = lines[i].replace('\n',"")
            
            # Validate to don't have useless values, E.G: [""]
            if len(line) > 1:
                
                # Split the row into strings, split on ‚Äò,‚Äô and create a list for each line
                line = line.split(",")

                # Detect/subset type of activity log
                activity = line[0]
                
                # In case the activity is "InteractionNPC"
                if activity == "InteractionNPC":
                    location = line[1]
                    player = line[2]
                    npc_name = line[3]
                    pos_x = line[4]
                    pos_y = line[5]
                    pos_z = line[6]


                    # New row to add to the dataframe
                    newrow = pd.Series([location,player,npc_name,pos_x,pos_y,pos_z], index = intnpcdf.columns)

                    # Updated dataframe
                    intnpcdf = intnpcdf.append(newrow, ignore_index = True)

    return intnpcdf

# üíÄ Killed info

This data is represented at the beginning of the line as "Player killed", for analysis purposes we are only going to use:

* Player: Id of the player

* Timestamp: Time when action was recorded, which is a counter of every 20 milliseconds

* Killed By: NPC or user who killed the player

In [None]:
# Function to extract player killed info
def killedext(path):
    # Empty killed DF
    killeddf = pd.DataFrame(columns = ['player','timestamp','killed_by'])

    with open(path, encoding = 'cp1252') as f:

        # Extract each line of the txt as a string
        lines = f.readlines()

        # Measure number of lines in player record
        long = len(lines)

        # Iterate over each status of the player
        for i in range(long):

            # Replace \n value in each string of file
            line = lines[i].replace('\n',"")
            
            # Validate to don't have useless values, E.G: [""]
            if len(line) > 1:
                
                # Split the row into strings, split on ‚Äò,‚Äô and create a list for each line
                line = line.split(",")

                # Detect/subset type of activity log
                activity = line[0]
                
                # In case the activity is "Player killed"
                if activity == "Player killed":
                    player = line[1]
                    timestamp = line[2]
                    killed_by = line[3]


                    # New row to add to the dataframe
                    newrow = pd.Series([player,timestamp,killed_by], index = killeddf.columns)

                    # Updated dataframe
                    killeddf = killeddf.append(newrow, ignore_index = True)

    return killeddf

# üî´ Shooting to Death info

This data is represented at the beginning of the line as "Player shooting a dead", for analysis purposes we are only going to use:

* Player: Id of the player

* Timestamp: Time when action was recorded, which is a counter of every 20 milliseconds

* Shooting to: NPC or user who the player killed

In [None]:
# Function to extract player shots info
def shootext(path):
    # Empty shots DF
    shootdf = pd.DataFrame(columns = ['player','timestamp','shooting_to'])

    with open(path, encoding = 'cp1252') as f:

        # Extract each line of the txt as a string
        lines = f.readlines()

        # Measure number of lines in player record
        long = len(lines)

        # Iterate over each status of the player
        for i in range(long):

            # Replace \n value in each string of file
            line = lines[i].replace('\n',"")
            
            # Validate to don't have useless values, E.G: [""]
            if len(line) > 1:
                
                # Split the row into strings, split on ‚Äò,‚Äô and create a list for each line
                line = line.split(",")

                # Detect/subset type of activity log
                activity = line[0]
                
                # In case the activity is "Player shooting a dead"
                if activity == "Player shooting a dead":
                    player = line[1]
                    timestamp = line[2]
                    shooting_to = line[3]


                    # New row to add to the dataframe
                    newrow = pd.Series([player,timestamp,shooting_to], index = shootdf.columns)

                    # Updated dataframe
                    shootdf = shootdf.append(newrow, ignore_index = True)

    return shootdf

# ü™ô Looting items info

This data is represented at the beginning of the line as "PlayerLootedItem", for analysis purposes we are only going to use:

* Player: Id of the player

* Timestamp: Time when action was recorded, which is a counter of every 20 milliseconds

* Item: object looted

In [None]:
# Function to extract items looted info
def lootitemext(path):
    # Empty items looted DF
    lootitemdf = pd.DataFrame(columns = ['player','timestamp','item'])

    with open(path, encoding = 'cp1252') as f:

        # Extract each line of the txt as a string
        lines = f.readlines()

        # Measure number of lines in player record
        long = len(lines)

        # Iterate over each status of the player
        for i in range(long):

            # Replace \n value in each string of file
            line = lines[i].replace('\n',"")
            
            # Validate to don't have useless values, E.G: [""]
            if len(line) > 1:
                
                # Split the row into strings, split on ‚Äò,‚Äô and create a list for each line
                line = line.split(",")

                # Detect/subset type of activity log
                activity = line[0]
                
                # In case the activity is "PlayerLootedItem"
                if activity == "PlayerLootedItem":
                    player = path
                    timestamp = line[2]
                    item = line[1]


                    # New row to add to the dataframe
                    newrow = pd.Series([player,timestamp,item], index = lootitemdf.columns)

                    # Updated dataframe
                    lootitemdf = lootitemdf.append(newrow, ignore_index = True)

    return lootitemdf

# üë®‚Äçüîß Data Engineering

First let's create the empty dataframes to allocate the data

In [None]:
# Empty Quests DF
questsdf = pd.DataFrame(columns = ['player','quest_name', 'status', 'timestamp'])

# Empty Dialogue DF
dialoguedf = pd.DataFrame(columns = ['player', 'character', 'utterance', 'timestamp'])

# Empty Attacked DF
attackdf = pd.DataFrame(columns = ['player', 'timestamp', 'object_attacked', 'reason'])

# Empty Position DF
positiondf = pd.DataFrame(columns = ['location','player','timestamp','pos_x','pos_y','pos_z','orient_x','orient_y','orient_z','health'])

# Empty object interaction DF
intobjdf = pd.DataFrame(columns = ['location','player','object','pos_x','pos_y','pos_z'])

# Empty door interaction DF
intdoordf = pd.DataFrame(columns = ['location','player','pos_x','pos_y','pos_z'])

# Empty NPC interaction DF
intnpcdf = pd.DataFrame(columns = ['location','player','npc_name','pos_x','pos_y','pos_z'])

# Empty killed DF
killeddf = pd.DataFrame(columns = ['player','timestamp','killed_by'])

# Empty shots DF
shootdf = pd.DataFrame(columns = ['player','timestamp','shooting_to'])

# Empty items looted DF
lootitemdf = pd.DataFrame(columns = ['player','timestamp','item'])

Also we will need a function to append the dataframes, and stack one over another

In [None]:
def appender(root):
    # Add to Quests DF
    global questsdf
    new_questsdf = questsdf.append(questext(root), ignore_index=True)

    # Add to Dialogue DF
    global dialoguedf
    new_dialoguedf = dialoguedf.append(dialoguext(root), ignore_index=True)

    # Add to Attacked DF
    global attackdf
    new_attackdf = attackdf.append(attackedext(root), ignore_index=True)

    # Add to Position DF
    global positiondf
    new_positiondf = positiondf.append(positionext(root), ignore_index=True)

    # Add to object interaction DF
    global intobjdf
    new_intobjdf = intobjdf.append(intobjext(root), ignore_index=True)

    # Add to door interaction DF
    global intdoordf
    new_intdoordf = intdoordf.append(intdoorext(root), ignore_index=True)

    # Add to NPC interaction DF
    global intnpcdf
    new_intnpcdf = intnpcdf.append(intnpcext(root), ignore_index=True)

    # Add to killed DF
    global killeddf
    new_killeddf = killeddf.append(killedext(root), ignore_index=True)

    # Add to shots DF
    global shootdf
    new_shootdf = shootdf.append(shootext(root), ignore_index=True)

    # Add to items looted DF
    global lootitemdf
    new_lootitemdf = lootitemdf.append(lootitemext(root), ignore_index=True)

    return new_questsdf, new_dialoguedf, new_attackdf, new_positiondf, new_intobjdf, new_intdoordf, new_intnpcdf, new_killeddf, new_shootdf, new_lootitemdf


We'll fecth all the data into a single dataframe, by applying our function all over the data directory and its subfolders

In [None]:
import os
import sys 

# Folder with data inside the directory
data_folder = '\\data'

# DIRECTORY path for the file WHERE the script is located 
directory = sys.path[0] + data_folder

# Loop over main directory
for file in os.listdir(directory): # return the directory of the current file
    # Complete FILE path
    file_path = directory + "\\" + file
    print(file_path)
    
    # Check if the listed file is a directory
    if os.path.isdir(file_path):
        # Loop over directory inside directory
        for i in os.listdir(file_path):
            # In-directory FILE path
            in_file_path = file_path + "\\" + i
            # Append function
            questsdf, dialoguedf, attackdf, positiondf, intobjdf, intdoordf, intnpcdf, killeddf, shootdf, lootitemdf = appender(in_file_path)
    # Check if the listed file is a txt file
    elif file.endswith(".txt"):
        # Append function
        questsdf, dialoguedf, attackdf, positiondf, intobjdf, intdoordf, intnpcdf, killeddf, shootdf, lootitemdf = appender(file_path)
    else:
        # Append function
        questsdf, dialoguedf, attackdf, positiondf, intobjdf, intdoordf, intnpcdf, killeddf, shootdf, lootitemdf = appender(file_path)


```
benchmark: 12 minutes with 2.8 seconds
```

# Writing CSV files

Finally we are going to save it into CSV files

In [None]:
attackdf.to_csv('parsed_data/attackdf.csv', encoding='utf-8', index=True)
dialoguedf.to_csv('parsed_data/dialoguedf.csv', encoding='utf-8', index=True)
intdoordf.to_csv('parsed_data/intdoordf.csv', encoding='utf-8', index=True)
intnpcdf.to_csv('parsed_data/intnpcdf.csv', encoding='utf-8', index=True)
intobjdf.to_csv('parsed_data/intobjdf.csv', encoding='utf-8', index=True)
killeddf.to_csv('parsed_data/killeddf.csv', encoding='utf-8', index=True)
lootitemdf.to_csv('parsed_data/lootitemdf.csv', encoding='utf-8', index=True)
positiondf.to_csv('parsed_data/positiondf.csv', encoding='utf-8', index=True)
questsdf.to_csv('parsed_data/questsdf.csv', encoding='utf-8', index=True)
shootdf.to_csv('parsed_data/intdoordf.csv', encoding='utf-8', index=True)