## Preprocessing



1. Inspecting the documentation
2. Removing duplicates
2. Removing the HTML tags from the 'body' content
3. Making sure each 'body' content includes the country's name
4. Removing newlines



### 1) Inspecting the documentation

In [1]:
ls

documentation2.json  documentation4.json  preprocessed.json
documentation3.json  documentation.json   [0m[01;34msample_data[0m/


In [2]:
import json

In [3]:
with open('documentation.json') as f:
  data = json.load(f)

In [4]:
type(data)

list

In [5]:
# data

In [6]:
len(data)

414

In [7]:
data[0]

{'id': '6683745',
 'title': 'Serbia: SMS Guidelines',
 'body': '<h1 id="h_7b0806db4c"><b>Serbia: SMS Guidelines</b></h1>\n<p class="no-margin"><b>MCC: </b>220<br><b>Dial Code: </b>381<br><br>Alphanumeric Sender IDs are supported with registration. <br><br>Registration is possible to ensure Alphanumeric Senders can be maintained. Without registration, Alpha Senders will be overwritten to Generic Alpha Sender IDs. <br><br>The registration of Alphanumeric Senders involves monthly fees.<br><br>Please make sure to always refer to our <a href="https://app.intercom.com/" target="_blank" class="intercom-content-link">Acceptable Use Policy for Messaging</a>.<br><br>For more information on Alpha Sender ID registration kindly reach out to <a href="mailto:alpha_sender_id@telnyx.com" target="_blank" class="intercom-content-link">alpha_sender_id@telnyx.com</a>.<br><br></p>',
 'url': 'https://support.telnyx.com/en/articles/6683745-serbia-sms-guidelines'}

In [8]:
type(data[0])

dict

### 2) Removing duplicates

In [9]:
new_data = []

# loop through each dict
for state in data:
  # check if we have encountered this state yet
  if state not in new_data:
    # if not, append it to the list
    new_data.append(state)

In [10]:
len(new_data)

212

In [11]:
new_data[0]

{'id': '6683745',
 'title': 'Serbia: SMS Guidelines',
 'body': '<h1 id="h_7b0806db4c"><b>Serbia: SMS Guidelines</b></h1>\n<p class="no-margin"><b>MCC: </b>220<br><b>Dial Code: </b>381<br><br>Alphanumeric Sender IDs are supported with registration. <br><br>Registration is possible to ensure Alphanumeric Senders can be maintained. Without registration, Alpha Senders will be overwritten to Generic Alpha Sender IDs. <br><br>The registration of Alphanumeric Senders involves monthly fees.<br><br>Please make sure to always refer to our <a href="https://app.intercom.com/" target="_blank" class="intercom-content-link">Acceptable Use Policy for Messaging</a>.<br><br>For more information on Alpha Sender ID registration kindly reach out to <a href="mailto:alpha_sender_id@telnyx.com" target="_blank" class="intercom-content-link">alpha_sender_id@telnyx.com</a>.<br><br></p>',
 'url': 'https://support.telnyx.com/en/articles/6683745-serbia-sms-guidelines'}

In [12]:
# save the updated json file
with open('documentation2.json', 'w') as outfile:
    json.dump(new_data, outfile, indent=4)

In [13]:
# open updated json file
with open('documentation2.json') as f:
  data2 = json.load(f)

In [14]:
len(data2)

212

### 3) Removing the HTML tags from the 'body' content

In [15]:
from bs4 import BeautifulSoup

In [16]:
# example for a single dict's 'body' content
BeautifulSoup(data2[0]['body'].replace('<br><b>', '. ').replace('. <br><br>', '. ').replace('.<br><br>', '. ').replace('<br><br>', '. '), 'html.parser').text

'Serbia: SMS Guidelines\nMCC: 220. Dial Code: 381. Alphanumeric Sender IDs are supported with registration. Registration is possible to ensure Alphanumeric Senders can be maintained. Without registration, Alpha Senders will be overwritten to Generic Alpha Sender IDs. The registration of Alphanumeric Senders involves monthly fees. Please make sure to always refer to our Acceptable Use Policy for Messaging. For more information on Alpha Sender ID registration kindly reach out to alpha_sender_id@telnyx.com. '

In [17]:
# remove html from all 'body' content
for state in data2:
  state['body'] = BeautifulSoup(state['body'].replace('<br><b>', '. ').replace('. <br><br>', '. ').replace('.<br><br>', '. ').replace('<br><br>', '. '), 'html.parser').text

In [18]:
# save the updated json file
with open('documentation3.json', 'w') as outfile:
    json.dump(data2, outfile, indent=4)

In [19]:
# double check the new file
with open('documentation3.json') as f:
  data3 = json.load(f)

In [20]:
print(data3)

[{'id': '6683745', 'title': 'Serbia: SMS Guidelines', 'body': 'Serbia: SMS Guidelines\nMCC: 220. Dial Code: 381. Alphanumeric Sender IDs are supported with registration. Registration is possible to ensure Alphanumeric Senders can be maintained. Without registration, Alpha Senders will be overwritten to Generic Alpha Sender IDs. The registration of Alphanumeric Senders involves monthly fees. Please make sure to always refer to our Acceptable Use Policy for Messaging. For more information on Alpha Sender ID registration kindly reach out to alpha_sender_id@telnyx.com. ', 'url': 'https://support.telnyx.com/en/articles/6683745-serbia-sms-guidelines'}, {'id': '6683734', 'title': 'Republic of Korea (South Korea): SMS Guidelines', 'body': 'Republic of Korea (South Korea): SMS Guidelines\nMCC: 450. Dial Code: 82. Alphanumeric Sender IDs are not supported. Registration is not possible.\n\nAll Alphanumeric Sender IDs will be overwritten to a random Local Long Code  to ensure delivery. All message

In [21]:
data3[0]

{'id': '6683745',
 'title': 'Serbia: SMS Guidelines',
 'body': 'Serbia: SMS Guidelines\nMCC: 220. Dial Code: 381. Alphanumeric Sender IDs are supported with registration. Registration is possible to ensure Alphanumeric Senders can be maintained. Without registration, Alpha Senders will be overwritten to Generic Alpha Sender IDs. The registration of Alphanumeric Senders involves monthly fees. Please make sure to always refer to our Acceptable Use Policy for Messaging. For more information on Alpha Sender ID registration kindly reach out to alpha_sender_id@telnyx.com. ',
 'url': 'https://support.telnyx.com/en/articles/6683745-serbia-sms-guidelines'}

### 4) Making sure each 'body' content includes the country's name

Most 'body' sections start with the country's name, but some don't.

In [22]:
data3[0]['title']

'Serbia: SMS Guidelines'

In [23]:
data3[0]['title'] in data3[0]['body']

True

In [24]:
# let us find the country dicts
# in which the country is not mentioned in the body

# loop through all states
for state in data3:
  # if title not in body
  if state['title'] not in state['body']:
    # print title
    print(state['title'])

Yemen: SMS Guidelines
Uganda: SMS Guidelines
Turks and Caicos Islands: SMS Guidelines
Turkmenistan: SMS Guidelines
Syria: SMS Guidelines
Singapore: SMS Guidelines
Sierra Leone: SMS Guidelines
Saint Kitts & Nevis: SMS Guidelines
Palestinian Territory: SMS Guidelines
Palau: SMS Guidelines
Namibia: SMS Guidelines
Myanmar: SMS Guidelines
Mozambique: SMS Guidelines
Montserrat: SMS Guidelines
Mongolia: SMS Guidelines
Monaco: SMS Guidelines
Mauritius: SMS Guidelines
Mauritania: SMS Guidelines
Madagascar: SMS Guidelines
Laos PDR: SMS Guidelines
Kyrgyzstan: SMS Guidelines
Indonesia: SMS Guidelines
India: SMS Guidelines
Greenland: SMS Guidelines
Ghana: SMS Guidelines
Gambia: SMS Guidelines
Fiji: SMS Guidelines
Eritrea: SMS Guidelines
Cook Islands: SMS Guidelines
Congo: SMS Guidelines
China: SMS Guidelines
Chad: SMS Guidelines
Cape Verde: SMS Guidelines
Burundi: SMS Guidelines
Bangladesh: SMS Guidelines
American Samoa: SMS Guidelines
Suriname: SMS Guidelines
Iraq: SMS Guidelines
Ecuador: SMS Guid

In [25]:
# Example of a dict with missing country name in the body section
for state in data3:
  if state['id'] == '6545167':
    print(state)

{'id': '6545167', 'title': 'Poland: SMS Guidelines', 'body': 'MCC: 260. Dial Code: 48. Alphanumeric Sender IDs are supported and will be maintained, no registration is required.\nThe use of generic Alpha Sender IDs is not recommended. Alpha Senders should be directly related to the message content. \nThere are no restrictions with regards to content towards this destination.\n\nPlease make sure to always refer to our Acceptable Use Policy for Messaging.', 'url': 'https://support.telnyx.com/en/articles/6545167-poland-sms-guidelines'}


In [26]:
# loop through all states
for state in data3:
  # if title not in body
  if state['title'] not in state['body']:
    # add title string to body
    state['body'] = state['title'] + '. ' + state['body']

In [27]:
# save the updated documentation file
with open('documentation4.json', 'w') as outfile:
    json.dump(data3, outfile, indent=4)

In [28]:
# open updated documentation file
with open('documentation4.json') as f:
  data4 = json.load(f)

In [29]:
# Confirm that the country name is now present in the 'body'
for state in data4:
  if state['id'] == '6545167':
    print(state)

{'id': '6545167', 'title': 'Poland: SMS Guidelines', 'body': 'Poland: SMS Guidelines. MCC: 260. Dial Code: 48. Alphanumeric Sender IDs are supported and will be maintained, no registration is required.\nThe use of generic Alpha Sender IDs is not recommended. Alpha Senders should be directly related to the message content. \nThere are no restrictions with regards to content towards this destination.\n\nPlease make sure to always refer to our Acceptable Use Policy for Messaging.', 'url': 'https://support.telnyx.com/en/articles/6545167-poland-sms-guidelines'}


### 5) Removing newlines

In [30]:
for state in data4:
  state['body'] = state['body'].replace('.\n\n', '. ').replace('\n', '. ')

In [31]:
data4[0]

{'id': '6683745',
 'title': 'Serbia: SMS Guidelines',
 'body': 'Serbia: SMS Guidelines. MCC: 220. Dial Code: 381. Alphanumeric Sender IDs are supported with registration. Registration is possible to ensure Alphanumeric Senders can be maintained. Without registration, Alpha Senders will be overwritten to Generic Alpha Sender IDs. The registration of Alphanumeric Senders involves monthly fees. Please make sure to always refer to our Acceptable Use Policy for Messaging. For more information on Alpha Sender ID registration kindly reach out to alpha_sender_id@telnyx.com. ',
 'url': 'https://support.telnyx.com/en/articles/6683745-serbia-sms-guidelines'}

In [32]:
data4[1]

{'id': '6683734',
 'title': 'Republic of Korea (South Korea): SMS Guidelines',
 'body': 'Republic of Korea (South Korea): SMS Guidelines. MCC: 450. Dial Code: 82. Alphanumeric Sender IDs are not supported. Registration is not possible. All Alphanumeric Sender IDs will be overwritten to a random Local Long Code  to ensure delivery. All messages to this destination will have the following text added by default:. [Web 발신] : This indicates A2P traffic[국제발신]: This indicates that the message has been sent from abroad. Gambling and Adult content is not permitted. Please make sure to always refer to our Acceptable Use Policy for Messaging.',
 'url': 'https://support.telnyx.com/en/articles/6683734-republic-of-korea-south-korea-sms-guidelines'}

In [33]:
# save the updated json file
with open('preprocessed.json', 'w') as outfile:
    json.dump(data4, outfile, indent=4)

In [34]:
ls

documentation2.json  documentation4.json  preprocessed.json
documentation3.json  documentation.json   [0m[01;34msample_data[0m/
