# Analysing Cryptocurrency Tweets
    
Environment: Python 3.8.5 and Anaconda 4.10.3 (64-bit)
    
Libraries used:

 - re 2.2.1 (for regular expressions, included in Anaconda Python 3.8.5)
    
    
## 1. Introduction
This assignment consists of analyzing semi-structured text files in order to extract data and convert it to XML format.

1. Get all the data from the given '.txt' file.
2. Perform the necessary steps to extract the required content (using regular expressions) and further preprocessing to store all the required
data in a collective data type, say a dictionary.
3. Transform the obtained data into XML format.

The following sections explain more about this process.

## 2. Import libraries

Importing all the necessary libraries.

In [1]:
import re

## 3. Loading data

The first step in the process is to load the given text file. The content, which is stored in a list called 'lines' using the readlines()
file method. The data stored in 'lines' is twitter data.

For each user, their name, usercode, description, number of followers, verified or not status and tweets are included.

In [2]:
input_file = open('task_input.txt', encoding = 'utf-8')
lines = input_file.readlines()
lines

['$uname.: Joko ⚡️\n',
 '$user_code.: 100022373\n',
 '$udesc.: Stacking Sats 🟠\n',
 'Routing Lightning Payments ⚡️\n',
 'Founder @BTC21_de 🖊️\n',
 'Working @ShiftCryptoHQ 🔑\n',
 '$followerNo.: 5.0 $verified?.: False $tweet_date.: 2021-04-19 18:57:58\n',
 '$tweet_text.: We are proud to be one of the first online toy retailers to accept Ethereum, Litecoin and Bitcoin. #bitcoin #Ethereum #litecoin #ltc #btc #eth #toystore #funko #funkopop #cryptocurrency #smallbusiness https://t.co/OdHGBFa6h0\n',
 '$uname.: Gaven Sirois\n',
 '$user_code.: 100018635\n',
 '$userdescription.: Ex-Merchant Navy officer 🚢, Trader, Father of 5 kiddos, Superdad Hall of Fame recipient, Jesus. Philadelphia Eagles. Specializing in automated trading #forex\n',
 '$No. followers.: 21.0 $verified_user?.: False $tweet_date.: 2021-04-19 18:57:57\n',
 '$tweet_text.: Weekly time frame frame for #Bitcoin. Did you see it? #BTC https://t.co/o7GSkawEQe\n',
 '$username.: Thomas Davies\n',
 '$user_code.: 100021889\n',
 '$userdesc

Next, we are joining all the elements of the list 'lines' using the 'join' method of strings.
We are also trying to replace & (ampersand), ' (apostrophe) and " (double quotes) as these are special characters in the XML
format.

In [3]:
ip_text = ''.join(lines)
ip_text = ip_text.replace('&', '&amp;')
ip_text = ip_text.replace("'", '&apos;')
ip_text = ip_text.replace('"', '&quot;')
ip_text



## 3. Extraction and storing of the text

Each user has a username and usercode that helps uniquely identify them. Hence, we are splitting our string on the basis of the re.split() method using
the regular expression \$\w+name\.:.*. The pattern will help split the string on the basis of the username.

'`\$\w+name\.:.*`' is pattern_one. Here, we are looking for those tags in the input string that start with a '`$`', just like all the other tags in the data.
The tag for the username is of different formats. Some use '`$uname.:`', while others can be either using '`$username.:`' or '`$user_name.:.`'


Hence, '`\w+`' looks for the existence/occurrence of one or more alphanumeric character(s). In this case, it can match either 'u', 'user' or 'user_'.
'`name`' in the pattern looks for the 'name' in the tags, which is common amongst all of them.
Since, period '`(.)`' is the wildcard character for regular expressions, we are using '`\`' to escape it and treat it rather as a period and not
a wildcard charcater. All tags also seem to include a colon, '`:`', after the period.
'`.*`' looks for zero or more occurrence of any charcater. In this cases, it'll match the names of the users.

In [4]:
pattern_one = re.compile("\$\w+name\.:.*")
details = re.split(pattern_one, ip_text) #details is a list that will store the strings, after splitting
details.pop(0) #since the first string after splitting is an empty string, we are popping it.

''

In [5]:
#replacing all newline character(s)
for idx in range(len(details)):
    details[idx] = details[idx].replace('\n', ' ')

Along with the regular expression used in pattern_one above for names of the users, pattern_two also includes '`\s(\$user_code\.:\s\d{9})`'.
This aims to match the usercodes in the data, which is a unique numeric nine-digit code for each user.
In the pattern, '`\s`' matches a space character. Again, since all tags begin with a '`$`', it is included in the pattern.
Almost all tags in the given file is the same for usercodes, which is '`$user_code.:`', hence it is a part of the pattern. '`\d{9}`' indicates
matching nine digits/numbers.

We can use re.findall() method to find all instances of the username and usercode (pairs) in the input_text string and store it in a list 
called 'users', with each username and correspoding usercode as a tuple in the list.

In [6]:
pattern_two = re.compile("(\$\w+name\.:.*)\s(\$user_code\.:\s\d{9})")
users = re.findall(pattern_two, ip_text)
users

[('$uname.: Joko ⚡️', '$user_code.: 100022373'),
 ('$uname.: Gaven Sirois', '$user_code.: 100018635'),
 ('$username.: Thomas Davies', '$user_code.: 100021889'),
 ('$user_name.: Brandon 💎', '$user_code.: 100000736'),
 ('$user_name.: Adam S. Tracy', '$user_code.: 100003472'),
 ('$user_name.: Investisseur Crypto', '$user_code.: 100018470'),
 ('$user_name.: Crypto Geeks', '$user_code.: 100052727'),
 ('$uname.: Antonio Da Silva #FX', '$user_code.: 100078162'),
 ('$username.: Crypto Stuey', '$user_code.: 100042149'),
 ('$user_name.: Amareswar', '$user_code.: 100050098'),
 ('$username.: non-fungible proteins', '$user_code.: 100029317'),
 ('$username.: charlie', '$user_code.: 100061395'),
 ('$uname.: boostedBenz', '$user_code.: 100028586'),
 ('$user_name.: Aditya Chutia', '$user_code.: 100028283'),
 ('$username.: Amony', '$user_code.: 100010488'),
 ('$uname.: D.CRYPTO 🍥', '$user_code.: 100001197'),
 ('$uname.: Wednesdayss', '$user_code.: 100033571'),
 ('$username.: Joseph Miller', '$user_code.

In [7]:
#converting each tuple in the list users into a list
for idx in range(len(users)):
    users[idx] = list(users[idx])
users

[['$uname.: Joko ⚡️', '$user_code.: 100022373'],
 ['$uname.: Gaven Sirois', '$user_code.: 100018635'],
 ['$username.: Thomas Davies', '$user_code.: 100021889'],
 ['$user_name.: Brandon 💎', '$user_code.: 100000736'],
 ['$user_name.: Adam S. Tracy', '$user_code.: 100003472'],
 ['$user_name.: Investisseur Crypto', '$user_code.: 100018470'],
 ['$user_name.: Crypto Geeks', '$user_code.: 100052727'],
 ['$uname.: Antonio Da Silva #FX', '$user_code.: 100078162'],
 ['$username.: Crypto Stuey', '$user_code.: 100042149'],
 ['$user_name.: Amareswar', '$user_code.: 100050098'],
 ['$username.: non-fungible proteins', '$user_code.: 100029317'],
 ['$username.: charlie', '$user_code.: 100061395'],
 ['$uname.: boostedBenz', '$user_code.: 100028586'],
 ['$user_name.: Aditya Chutia', '$user_code.: 100028283'],
 ['$username.: Amony', '$user_code.: 100010488'],
 ['$uname.: D.CRYPTO 🍥', '$user_code.: 100001197'],
 ['$uname.: Wednesdayss', '$user_code.: 100033571'],
 ['$username.: Joseph Miller', '$user_code.

re.sub() is a regex method that works similar to the replace method of strings. Here, each user's tag is 
replaced with '`user name`' for consistency.

In [8]:
for idx in range(len(users)):
    users[idx][0] = re.sub(r"\$\w+name\.:", "user name:", users[idx][0])

In the list '`details`', we stored strings by splitting on the names of the users. Now, using '`users`', we are adding back the names of each user to their respective string that contains all their details. This is done with the help of the usercode.

In [9]:
#checking if usercode from 'users' exists in 'details', if yes, necessary updates are made
for idx in range(len(users)):
    if users[idx][1] in details[idx]:
        details[idx] = "---" + users[idx][0] + "---" + details[idx]

Now since we have all the data we need, we can next try to bring in some consistency by replacing all the tags using regular expressions and the re.sub() method.


'`\$\w+desc\w*\.:`' - a regular expression for matching all the tags of the user's description.
'`\$`' looks for the dollar symbol, which each tag in the data starts with. '`\w+`' (for one or more occurrences of an alphanumeric character) handles '`user`', '`u`' or '`user_`' in the tags,
'`desc`', for matching 'desc' and '`\w*`' matches zero or more occurrences of alphanumeric characters, that is, 'desc', 'description' in the tags. '`\.`' is for a period and '`:`' is simply a colon.

'`\$tweet\w*\.:`' - regular expressions for tags that contain tweets. Here, it matches tags including, '`$tweet.:`' and '`$tweet_text.:`'. '`\w*`' handles '_text', as it looks for 0 or more occurrences of alphanumeric characters. Lastly,
'`\.:`' to match a period and colon.

'`\$verified\w*\?\.:`' - this regex looks at those tags that contain the verified status of the users. Tags are either of the form '`$verified?.:`' or '`verified_user?.:`'.
The '`\w*`' matches the '_user' part of the tag, it it exists. '`?`' is for the '?' withing the tags, followed by the period and colon ('`\.:`').

'`\$\w*\.?\s*follower\w*\.:`' - this matches those tags which hold information regarding the number of followers a user has.
These tags are: '`$No. followers.:`' and '`$followerNo.:`'.
'`\$`' matches '$', 
'`\w*`' for zero or more occurrences of alphanumeric characters ('No' for instance),
'`\.?`' for zero or one occurrence of a period,
'`\s*`' for 0 or more space characters,
'`follower`' for 'follower' in the tag,
'`\w*`', same as above, followed by '`\.:`' to match a period and colon.


Since the '`$user_code.:`' and '`$tweet_date.:`' are the only consistent tags, these have been replaced via the replace method of strings.

Now, we have,
 - '`---user_description:`' - new replacement for user description
 - '` -tweet_date:`' - new replacement for tweet dates
 - '`usercode:`' - new replacement for usercode
 - '`---tweet:`' - new replacement for tweets
 - '`-verified_user:`' - new replacement for verified status of a user
 - '`-no_followers:`' - new replacement for the number of followers of a user

The '-' and '---' in the new tags have been added to use them, if required, for any splitting or extraction later.

In [10]:
#replacement of all tags in the 'details' list
for idx in range(len(details)):
    details[idx] = re.sub(r"\$\w+desc\w*\.:", "---user_description:", details[idx])
    
for idx in range(len(details)):
    details[idx] = details[idx].replace('$tweet_date.:', ' -tweet_date:')
    
for idx in range(len(details)):
    details[idx] = details[idx].replace('$user_code.:', 'usercode:')
    
for idx in range(len(details)):
    details[idx] =  re.sub(r"\$tweet\w*\.:", "---tweet:", details[idx])
    
for idx in range(len(details)):
    details[idx] = re.sub(r"\$verified\w*\?\.:", "-verified_user:", details[idx])
    
for idx in range(len(details)):
    details[idx] = re.sub(r"\$\w*\.?\s*follower\w*\.:", "-no_followers:", details[idx])

Similarly, we're making changes to the usercode tags in the 'users' list as well.

In [11]:
#replacement of tag '$user_code.:' in 'users'
for idx in range(len(users)):
    users[idx][1] = users[idx][1].replace('$user_code.:', 'usercode:')

Next, using the 'users' and 'details' lists, we are storing the details of each user in a dictionary named 'user_dictionary'.
The usercodes from 'users' are used as keys for the dictionary, whereas values are coming from 'details'.

In [12]:
#for storing details of each user
user_dictionary = dict()
for key, value in zip(users, details):
    if key[1] in user_dictionary.keys(): #if key exists in the dictionary, then append the value
        user_dictionary[key[1]].append(value)
    else:
        user_dictionary[key[1]] = [] #if not, create an empty list for the key to which the value will be appended
        user_dictionary[key[1]].append(value)

Now, as the value for each key in 'user_dictionary' is a list that contains string(s) of all the details combined, we are now going to split to obtain details in as individual strings, something like this:
    
'usercode: 100022373': ['',
  'user name: Joko ⚡️',
  ' usercode: 100022373 ',
  'user_description: Stacking Sats 🟠 Routing Lightning Payments ⚡️ Founder @BTC21_de 🖊️ Working @ShiftCryptoHQ 🔑 -no_followers: 5.0 -verified_user: False  -tweet_date: 2021-04-19 18:57:58 ',
  'tweet: We are proud to be one of the first online toy retailers to accept Ethereum, Litecoin and Bitcoin. #bitcoin #Ethereum #litecoin #ltc #btc #eth #toystore #funko #funkopop #cryptocurrency #smallbusiness https://t.co/OdHGBFa6h0 ']
    

Since the first element is an empty string, we pop it for each key's value.

In [13]:
for code in user_dictionary.keys():
    user_dictionary[code] = re.split(r"-{3}", ''.join(user_dictionary[code]))

In [14]:
for code in user_dictionary.keys():
    user_dictionary[code].pop(0)

There are some users who have mutiple tweets, hence we convert the list to remove these duplicates (for instance usernames, usercodes). Also, during the splitting process that happened above, some usercodes are of the form ' usercode: 100022373' and not
'usercode: 100022373'. Hence, we remove these extra ones by using re.sub() and the regex '`\s*usercode:\s\d{9}`'.

'`\s*usercode:\s\d{9}`' - '`\s*`' for 0 or mutilple spaces, '`usercode:`' matches the tag name and '`\s\d{9}`' for space followed by the 9 digit code. Once these have been replaced by empty strings, we remove them next.

In [15]:
for code in user_dictionary.keys():
    user_dictionary[code] = list(set(user_dictionary[code]))

In [16]:
for code in user_dictionary.keys():
    for idx in range(len(user_dictionary[code])):
        user_dictionary[code][idx] = re.sub(r"\s*usercode:\s\d{9}", "", user_dictionary[code][idx])

In [17]:
for code in user_dictionary.keys():
    if " " in user_dictionary[code]:
        user_dictionary[code].remove(" ")

Since there are mutiple tweets from some users, we also have multiple strings that store their metadata (user description, verified status, follower number). So to tackle this, we're using the latest tweet date to keep the latest metadata for a user.
The function 'maxDate' defined below can help us get the latest tweet date.

In [18]:
def maxDate(dates):
    '''
    This function returns the latest date for an input list of dates, 
    by checking the year, month, day, hour, minute and seconds.
    
    '''
    max_date = dates[0]
    for idx in range(len(dates)):

        year, month, day = dates[idx].split('-')
        day, time = day.split(' ')
        hour, minute, seconds = time.split(':')

        max_year, max_month, max_day = max_date.split('-')
        max_day, max_time = max_day.split(' ')
        max_hour, max_minute, max_seconds = max_time.split(':')

        if year > max_year:
            max_date = dates[idx]
        elif year == max_year:
            if month > max_month:
                max_date = dates[idx]
            elif month == max_month:
                if day > max_day:
                    max_date = dates[idx]
                elif day == max_day:
                    if hour > max_hour:
                        max_date = dates[idx]
                    elif hour == max_hour:
                        if minute > max_minute:
                            max_date = dates[idx]
                        elif minute == max_minute:
                            if seconds > max_seconds:
                                max_date = dates[idx]
                            elif seconds == max_seconds:
                                max_date = max_date
                            else:
                                max_date = max_date
                        else:
                            max_date = max_date
                    else:
                        max_date = max_date
                else:
                    max_date = max_date
            else:
                max_date = max_date
        else:
            max_date = max_date

    return max_date

The 'date_time_pattern' helps retrive the dates of tweets for each user and stores them in a list called 'dates'. This is then passed to the function defined above to get the latest date, if the length of 'dates' is greater than one.
Once the latest tweet date has been obtained, all those extra strings with metadata are then removed.

The pattern used: '`-tweet_date:\s*([0-9]{4}[-][0-9]{2}[-][0-9]{2}[\s]+[0-9]{2}[:][0-9]{2}[:][0-9]{2})`' looks for dates of the format,
'2021-09-03 15:03:30'.
'`-tweet_date:`' - looks for the tags that holds the date and time of tweet, '`\s*`' matches 0 or more spaces, '`[0-9]`' - character class that looks for digits from 0 to 9, '`{4}`' - or any of the form '{m}' or '`{m,n}`' look for m or minimumm and maximum of n occurrences of what has been mentioned before it (in this case 4  and 2 digits),
'`[-]`' and '`[:]`' look for '-' and ':' in between and '`[\s]+`' is a character class looking for 1 or more space characters.
With the 'group()' method, only the date (and not the entire match) is extracted.

In [19]:
date_time_pattern = re.compile(r"-tweet_date:\s*([0-9]{4}[-][0-9]{2}[-][0-9]{2}[\s]+[0-9]{2}[:][0-9]{2}[:][0-9]{2})")

#extraction of tweet dates
for code in user_dictionary.keys():
    dates = []
    for line in user_dictionary[code]:
        if re.search(date_time_pattern, line):
            dates.append(re.search(date_time_pattern, line).group(1))
            
    if len(dates) > 1:
        max_date = maxDate(dates) #if 'dates' contains more than one date, 'maxDate' fn is called
        
        #if a line doesn't contain the lastest tweet obtained, it is replaced by an empty string and then finally removed
        for idx in range(len(user_dictionary[code])):
            if re.search(date_time_pattern, user_dictionary[code][idx]):
                if max_date not in user_dictionary[code][idx]: 
                    user_dictionary[code][idx] = ''
                    
    user_dictionary[code] = list(filter(None, user_dictionary[code]))

'new_user_dict' is a new dictionary that'll hold all the details of each user from 'user_dictionary'. The difference is, each usercode is used to as key to identify each user and within this key, further tags are used as keys for storing information about the user.
For instance, for a user, there'll be a 'user name' key that'll hold the name of the user, 'tweets' is a key that holds a list of tweet(s),
'verified_user' is a key that'll hold the status of verification information of a user (either 'True' or 'False') 'user_description' will hold the description for the user and 'no_followers' will hold the followers a user has.

For example,

'usercode: 100022373' : {'tweets': [' We are proud to be one of the first online toy retailers to accept Ethereum, Litecoin and Bitcoin. #bitcoin #Ethereum #litecoin #ltc #btc #eth #toystore #funko #funkopop #cryptocurrency #smallbusiness https://t.co/OdHGBFa6h0 '],
 'verified_user': 'False',
 'user_description': ' Stacking Sats 🟠 Routing Lightning Payments ⚡️ Founder @BTC21_de 🖊️ Working @ShiftCryptoHQ 🔑 -no_followers: 5.0 -verified_user: False  -tweet_date: 2021-04-19 18:57:58 ',
 'no_followers': '5.0',
 'user name': ' Joko ⚡️'}

In [20]:
new_user_dict = dict()
for code in user_dictionary.keys():
    new_user_dict[code] = dict()
    new_user_dict[code]['tweets'] = []

To accomplish the above, 're.search()' is used to search for the given pattern/regex in each string within the list that holds details of each user.
Using 'group()' method, the required information is extracted and stored in the new dictionary.

 - '`user\sname:(.*)`' looks for the username tags and '`(.*)`' matches the information we require, here the names
 - '`tweet:(.*)`' looks for tags that contain tweets and append the tweet text to the new dictionary via '`(.*)`'
 - '`verified_user:\s(False|True)\s*`' looks for tags with the verified status, and extracts the status ('True' or 'False') via the group '`(False|True)`'
 - '`user_description:(.*)`' searches for tags that hold user description and then '`(.*)`' extracts the information
 - '`no_followers:\s(\d+\.\d*)`' searches for tags that hold the follower count and extract the number via '`(\d+\.\d+)`'; numbers of the form 5.0 for instance can be retrieved, where '`\d+`' is for one or more digits followed by a period ('`\.`') and then '`\d+`' again for the same purpose. 

In [21]:
#extraction of information from user_dictionary and storing in new_user_dict
for code in user_dictionary.keys():
    for line in user_dictionary[code]:
        
        if re.search(r"user\sname:(.*)", line):
            new_user_dict[code]['user name'] = re.search(r"user\sname:(.*)", line).group(1)
        
        if re.search(r"tweet:(.*)", line):
            new_user_dict[code]['tweets'].append(re.search(r"tweet:(.*)", line).group(1))
            
        if re.search(r"verified_user:\s(False|True)\s*", line):
            new_user_dict[code]['verified_user'] = re.search(r"verified_user:\s(False|True)\s*", line).group(1)

        if re.search(r"user_description:(.*)", line):
            new_user_dict[code]['user_description'] = re.search(r"user_description:(.*)", line).group(1)

        if re.search(r"no_followers:\s(\d+\.\d*)", line):
            new_user_dict[code]['no_followers'] = re.search(r"no_followers:\s(\d+\.\d+)", line).group(1)

'usercode: 100022373' : {'tweets': [' We are proud to be one of the first online toy retailers to accept Ethereum, Litecoin and Bitcoin. #bitcoin #Ethereum #litecoin #ltc #btc #eth #toystore #funko #funkopop #cryptocurrency #smallbusiness https://t.co/OdHGBFa6h0 '], 'verified_user': 'False', 'user_description': ' Stacking Sats 🟠 Routing Lightning Payments ⚡️ Founder @BTC21_de 🖊️ Working @ShiftCryptoHQ 🔑 -no_followers: 5.0 -verified_user: False -tweet_date: 2021-04-19 18:57:58 ', 'no_followers': '5.0', 'user name': ' Joko ⚡️'}

It can be seen that the 'user_description' key includes some unecessary information as well. To remove this, we're using regex to find this unecessary information that matches the regex pattern and then replacing it with an empty string. The same is also being applied to the tweets, to remove such instances, if any.
The patterns are somewhat similar to all those that've been used and mentioned in all above cases.

In [22]:
#replacement of uneccesary information
p1 = re.compile(r"-no_followers:\s\d+\.\d*")
p2 = re.compile(r"-verified_user:\s(False|True)\s*")
p3 = re.compile(r"-tweet_date:\s[0-9]{4}[-][0-9]{2}[-][0-9]{2}[\s]+[0-9]{2}[:][0-9]{2}[:][0-9]{2}")

for code in new_user_dict.keys():
    new_user_dict[code]['user_description'] = re.sub(p1, '', new_user_dict[code]['user_description'])
    new_user_dict[code]['user_description'] = re.sub(p2, '', new_user_dict[code]['user_description'])
    new_user_dict[code]['user_description'] = re.sub(p3, '', new_user_dict[code]['user_description'])
    
for code in new_user_dict.keys():
    for idx in range(len(new_user_dict[code]['tweets'])):
        new_user_dict[code]['tweets'][idx] = re.sub(p1, '', new_user_dict[code]['tweets'][idx])
        new_user_dict[code]['tweets'][idx] = re.sub(p2, '', new_user_dict[code]['tweets'][idx])
        new_user_dict[code]['tweets'][idx] = re.sub(p3, '', new_user_dict[code]['tweets'][idx])

In [23]:
new_user_dict

{'usercode: 100022373': {'tweets': [' We are proud to be one of the first online toy retailers to accept Ethereum, Litecoin and Bitcoin. #bitcoin #Ethereum #litecoin #ltc #btc #eth #toystore #funko #funkopop #cryptocurrency #smallbusiness https://t.co/OdHGBFa6h0 '],
  'verified_user': 'False',
  'user_description': ' Stacking Sats 🟠 Routing Lightning Payments ⚡️ Founder @BTC21_de 🖊️ Working @ShiftCryptoHQ 🔑   ',
  'no_followers': '5.0',
  'user name': ' Joko ⚡️'},
 'usercode: 100018635': {'tweets': [' Weekly time frame frame for #Bitcoin. Did you see it? #BTC https://t.co/o7GSkawEQe ',
   ' It is only a matter of a few days for #Bitcoin  before the biggest buy signal for #btc appears https://t.co/VnfrV2ZSlT '],
  'user name': ' Gaven Sirois',
  'verified_user': 'False',
  'user_description': ' Ex-Merchant Navy officer 🚢, Trader, Father of 5 kiddos, Superdad Hall of Fame recipient, Jesus. Philadelphia Eagles. Specializing in automated trading #forex   ',
  'no_followers': '176.0'},
 'us

After the steps above have been performed, we get the following:
    
'usercode: 100022373' : {'tweets': [' We are proud to be one of the first online toy retailers to accept Ethereum, Litecoin and Bitcoin. #bitcoin #Ethereum #litecoin #ltc #btc #eth #toystore #funko #funkopop #cryptocurrency #smallbusiness https://t.co/OdHGBFa6h0 '],
 'verified_user': 'False',
 'user_description': ' Stacking Sats 🟠 Routing Lightning Payments ⚡️ Founder @BTC21_de 🖊️ Working @ShiftCryptoHQ 🔑   ',
 'no_followers': '5.0',
 'user name': ' Joko ⚡️'}

## 4. Conversion to XML

Now that we've stored our data in the format we wanted, the next and final step is to convert it and store in an XML file.

A new file is opened and the data for each user from 'new_user_dict' using the keys are written to the XML file. The encoding used is 'utf-8' as the data also contains emojis/emoticons.
The root of the XML file will be '?xml version = "1.0"?', within which we are wrapping all the users and their information within respective tags.

For example, for one user:
    
<?xml version="1.0"?>

<users>
    <user name="Joko ⚡️">
        <verified_user>False</verified_user>
        <user_description> Stacking Sats 🟠 Routing Lightning Payments ⚡️ Founder @BTC21_de 🖊️ Working @ShiftCryptoHQ 🔑 </user_description>
        <no_followers>5.0</no_followers>
        <tweets>
            <tweet> We are proud to be one of the first online toy retailers to accept Ethereum, Litecoin and Bitcoin. #bitcoin #Ethereum #litecoin #ltc #btc #eth #toystore #funko #funkopop #cryptocurrency #smallbusiness https://t.co/OdHGBFa6h0 </tweet>
        </tweets>
    </user> 
</users>

In [24]:
#opening a new file and writing to it
output_f = open('output.xml', 'w', encoding = 'utf-8')
output_f.write('<?xml version = "1.0"?>' + '\n') #root element
output_f.write("<users>" + "\n") #within which each user is wrapped

for code, details in new_user_dict.items():
    
    output_f.write("<user name =" + '"' + new_user_dict[code]["user name"].strip() + '"' + ">" + "\n") #name of the user
    
    output_f.write("<verified_user>" + new_user_dict[code]["verified_user"] + "</verified_user>" + "\n") #verified status
             
    output_f.write("<user_description>" + new_user_dict[code]["user_description"] + "</user_description>" + "\n") #description
                                                                                                             #of user
    output_f.write("<no_followers>" + new_user_dict[code]["no_followers"] + "</no_followers>" + "\n") #follower count
    
    output_f.write("<tweets>" + "\n") #all tweets of the user
    for idx in range(len(new_user_dict[code]['tweets'])):
        output_f.write("<tweet>" + new_user_dict[code]["tweets"][idx] + "</tweet>" + "\n") #each tweet is wrapped within this
    output_f.write("</tweets>" + "\n")
    
    output_f.write("</user>" + "\n") #closing tag of user

output_f.write("</users>")   #closing tag of all users
output_f.close()

## 5. Summary


This assessment measured the understanding of parsing text documents that contain semi-structured text, like tweets, and extracting the data from such files. The main outcomes achieved as a part of the process include:

 - **Parsing of semi-structured text files**: Using file methods, it was possible to access the required textual data.
 - **Extraction and storage of text data**: After loading data, using basic Python data types and regex patterns, it was possible extract the required information and store it.
 - **Writing to another format**: Using file methods and methods of built-in data types of Python, the extracted data has been written to and stored in a file of XML format.