In this tutorial, you will learn how to implement **classes** in Python for storing and parsing the data collected for devices (smartphones, phones, smart watches, etc..) from [GSMArena](https://www.gsmarena.com). I scrapped the *devices_data.txt* file required for this tutorial from GSMArena by building a web crawler from scratch using BeautifulSoup. You can learn how to do this in my [previous tutorial](https://www.kaggle.com/vigvisw/collect-data-by-building-a-web-crawler). 



**DISCLAIMER:** The credit for originally assembling and making this data available in the public domain goes to the GSMArena team.

The *devices_data.txt* file can be downloaded on my [GitHub page](https://github.com/vigvisw/end2endml).

### Importing Devices Data

In [0]:
# import the required libraries
import json
import re
import numpy as np
import pandas as pd

# if you need to use Colab to import data
# from google.colab import files
# from google.colab import drive

**Options For Working with Data In Colab**

Full list of options and sample code can be found [here](https://colab.research.google.com/notebooks/io.ipynb).

In [0]:
# OPTION 1: upload the data to your virtual machine by calling the upload() method in files
# files.upload()

In [0]:
# OPTION 2: upload the data to you Drive, 
# mount it to your virtual machine and access it like local folder

# drive.mount('drive')

# the argument of the method is the path where the drive should be mounted
# by default this mounts the drive to the location '/contents' in you virtual machine
# you will have to provice an authorization token for this method to work
# log into you Google account and get the authorization code, paste it here and click Enter

In [0]:
# consol can be accessed in Colab using the '!' sign before command
# !ls

**OPTION 3:** Connecting Colab To Local Runtime

Detailed instructions for connecting Colab to you local Jupyter Notebook runtime can be found [here](https://research.google.com/colaboratory/local-runtimes.html).

Using any of the methods above, Once the data is available, read it in using the **json** module.

Resources for learning about *JSON* and I/O in Python:
1. [JSON](https://www.youtube.com/watch?v=pTT7HMqDnJw) by Socratica
2. [Reading and Writing Files](https://www.youtube.com/watch?v=Uh2ebFW8OYM&t=374s) by Corey Schafer


In [0]:
# use the function created in the last tutorial to read in the data
def read_devices_json(file_path):
  '''A function for reading in a JSON file.
     
     Takes in the string file_path and returns a dict
  '''
  with open(file_path, 'r', encoding='utf-8') as file:
    return json.load(file)

In [0]:
# load the file
file_path = '../devices_data.txt'
devices_dict = read_devices_json(file_path)

In [0]:
# check that the data has been imported as a dict and have look at its keys
print(type(devices_dict))
print(devices_dict.keys())

<class 'dict'>
dict_keys(['Acer', 'alcatel', 'Allview', 'Amazon', 'Amoi', 'Apple', 'Archos', 'Asus', 'AT&T;', 'Benefon', 'BenQ', 'BenQ-Siemens', 'Bird', 'BlackBerry', 'Blackview', 'BLU', 'Bosch', 'BQ', 'Casio', 'Cat', 'Celkon', 'Chea', 'Coolpad', 'Dell', 'Emporia', 'Energizer', 'Ericsson', 'Eten', 'Fujitsu Siemens', 'Garmin-Asus', 'Gigabyte', 'Gionee', 'Google', 'Haier', 'Honor', 'HP', 'HTC', 'Huawei', 'i-mate', 'i-mobile', 'Icemobile', 'Infinix', 'Innostream', 'iNQ', 'Intex', 'Jolla', 'Karbonn', 'Kyocera', 'Lava', 'LeEco', 'Lenovo', 'LG', 'Maxon', 'Maxwest', 'Meizu', 'Micromax', 'Microsoft', 'Mitac', 'Mitsubishi', 'Modu', 'Motorola', 'MWg', 'NEC', 'Neonode', 'NIU', 'Nokia', 'Nvidia', 'O2', 'OnePlus', 'Oppo', 'Orange', 'Palm', 'Panasonic', 'Pantech', 'Parla', 'Philips', 'Plum', 'Posh', 'Prestigio', 'QMobile', 'Qtek', 'Razer', 'Realme', 'Sagem', 'Samsung', 'Sendo', 'Sewon', 'Sharp', 'Siemens', 'Sonim', 'Sony', 'Sony Ericsson', 'Spice', 'T-Mobile', 'TECNO', 'Tel.Me.', 'Telit', 'Thuraya',

### Creating Classes To Work With Devices

Since we are working with multiple related things (devices) onto which we will be applying parsing functions is the future, it is a good idea to define a **class** called **Device** for working with the **devices_dict**.

Resources for learning about *Classes* and *Object Oriented Programming*
1. [Written tutorial](https://jeffknupp.com/blog/2014/06/18/improve-your-python-python-classes-and-object-oriented-programming/) by Jeff Knupp
2. [Video tutorial series](https://www.youtube.com/watch?v=ZDa-Z5JzLYM&list=PL-osiE80TeTsqhIuOqKhwlXsIBIdSeYtc) by Corey Schafer
3. [Methods vs Functions](https://www.geeksforgeeks.org/difference-method-function-python/) by GeeksForGeeks

First, we want to give each maker in the GSMArena dataset a *maker_id*.

In [0]:
# create a list of the form  [(0, 'Acer'), .......] for all makers in the devices_dict
makers = [(x, y) for x, y in zip(range(devices_dict.keys().__len__()), devices_dict.keys())]
print(makers)

[(0, 'Acer'), (1, 'alcatel'), (2, 'Allview'), (3, 'Amazon'), (4, 'Amoi'), (5, 'Apple'), (6, 'Archos'), (7, 'Asus'), (8, 'AT&T;'), (9, 'Benefon'), (10, 'BenQ'), (11, 'BenQ-Siemens'), (12, 'Bird'), (13, 'BlackBerry'), (14, 'Blackview'), (15, 'BLU'), (16, 'Bosch'), (17, 'BQ'), (18, 'Casio'), (19, 'Cat'), (20, 'Celkon'), (21, 'Chea'), (22, 'Coolpad'), (23, 'Dell'), (24, 'Emporia'), (25, 'Energizer'), (26, 'Ericsson'), (27, 'Eten'), (28, 'Fujitsu Siemens'), (29, 'Garmin-Asus'), (30, 'Gigabyte'), (31, 'Gionee'), (32, 'Google'), (33, 'Haier'), (34, 'Honor'), (35, 'HP'), (36, 'HTC'), (37, 'Huawei'), (38, 'i-mate'), (39, 'i-mobile'), (40, 'Icemobile'), (41, 'Infinix'), (42, 'Innostream'), (43, 'iNQ'), (44, 'Intex'), (45, 'Jolla'), (46, 'Karbonn'), (47, 'Kyocera'), (48, 'Lava'), (49, 'LeEco'), (50, 'Lenovo'), (51, 'LG'), (52, 'Maxon'), (53, 'Maxwest'), (54, 'Meizu'), (55, 'Micromax'), (56, 'Microsoft'), (57, 'Mitac'), (58, 'Mitsubishi'), (59, 'Modu'), (60, 'Motorola'), (61, 'MWg'), (62, 'NEC'), 

Use a smaller subset of the data so that we can follow along with what is happening. I will use the [Samsung Galaxy S10](https://www.gsmarena.com/samsung_galaxy_s10-9536.php). 

To uniqely identify each device, we want to make a **device_id** of the form, **maker_name** + '_' +  **device_num**, where  **device_num** is the key of the device under a maker.

In [0]:
maker_name = 'Samsung'

# get Samsung's maker_id form the list we just created
for maker in makers:
  if maker[1] == maker_name:
    maker_id = maker[0]
    
print(maker_id)

84


In [0]:
# get a data for Galaxy S10 from the devices_dict 
device_name = 'Galaxy S10'
for device_num, device in devices_dict[maker_name].items():
  if device['device_name'] == device_name:
    device_id = maker_name.upper() + '_' + device_num 
    break
    
print(device_id)
print(device)

SAMSUNG_3
{'device_name': 'Galaxy S10', 'device_info': 'Samsung Galaxy S10 Android smartphone. Announced Feb 2019. Features 6.1″ Dynamic AMOLED display, Exynos 9820 Octa chipset, 3400 mAh battery, 512 GB storage, 8 GB RAM, Corning Gorilla Glass 6.', 'device_img_link': 'https://cdn2.gsmarena.com/vv/bigpic/samsung-galaxy-s10.jpg', 'device_link': 'https://www.gsmarena.com/samsung_galaxy_s10-9536.php', 'device_specs': {'Network': {'Technology': 'GSM / CDMA / HSPA / EVDO / LTE', '2G bands': 'GSM 850 / 900 / 1800 / 1900 - SIM 1 & SIM 2 (dual-SIM model only)', 'NaN': 'CDMA 800 / 1900 - USA', '3G bands': 'HSDPA 850 / 900 / 1700(AWS) / 1900 / 2100 - Global, USA', '4G bands': 'LTE band 1(2100), 2(1900), 3(1800), 4(1700/2100), 5(850), 7(2600), 8(900), 12(700), 13(700), 17(700), 18(800), 19(800), 20(800), 25(1900), 26(850), 28(700), 32(1500), 38(2600), 39(1900), 40(2300), 41(2500), 66(1700/2100) - Global', 'Speed': 'HSPA 42.2/5.76 Mbps, LTE-A (7CA) Cat20 2000/150 Mbps', 'GPRS': 'Yes', 'EDGE': 'Yes

At the very least, we will have to use two classes. The first of these will be called **Device**. This class is used to create device objects which contain all the relavent attributes of a device such as it name, link, maker and most importantly, its specs. Remember that the spec data is text and hence we need to parse it to get the data in a form which can be manipulated by **pandas** down the line. To help with this process, **Device** needs to <a href="https://en.wikipedia.org/wiki/Inheritance_(object-oriented_programming)">***inherit***</a> from another class. This will be called **FeatureGen**. For now, we will **pass** **FeatureGenn** without doing anything, and add functionality to it later.

In [0]:
# define a class called FeatureGen which we will use to collect all the features from all the devices
# pass for now, we will add attributes and methods to this class in the next section
class FeatureGen:
  '''FeatureGen will contain a dict of all features from all the devices called all_features_dict
  
     It also contains a collection of useful methods for parsing data from a Devices object
  '''
  pass

In [0]:
# create a basic implemention of a class for working with the devices_data
# much more functionality will be added to this class later
# all Device's will inherit from FeatureGen
class Device(FeatureGen):
  '''A class for working with device data scrapped on GSMArena''' 
  # initliaze the class using a device 
  def __init__(self, device, device_id, maker_name, maker_id):
    self.maker_name = maker_name
    self.maker_id = maker_id
    self.device_id = device_id
    
    # set the device_info as attributes of the Device 
    for device_info_name, device_info in device.items():
      # emulates the functionality of self.varable = value
      setattr(self, device_info_name, device_info)

In [0]:
# create an instance of Device using the S10's data and call it device1
device1 = Device(device, device_id, maker_name, maker_id)

In [0]:
# check all the attributes stored under device1
device1.__dict__

{'maker_name': 'Samsung',
 'maker_id': 84,
 'device_id': 'SAMSUNG_3',
 'device_name': 'Galaxy S10',
 'device_info': 'Samsung Galaxy S10 Android smartphone. Announced Feb 2019. Features 6.1″ Dynamic AMOLED display, Exynos 9820 Octa chipset, 3400 mAh battery, 512 GB storage, 8 GB RAM, Corning Gorilla Glass 6.',
 'device_img_link': 'https://cdn2.gsmarena.com/vv/bigpic/samsung-galaxy-s10.jpg',
 'device_link': 'https://www.gsmarena.com/samsung_galaxy_s10-9536.php',
 'device_specs': {'Network': {'Technology': 'GSM / CDMA / HSPA / EVDO / LTE',
   '2G bands': 'GSM 850 / 900 / 1800 / 1900 - SIM 1 & SIM 2 (dual-SIM model only)',
   'NaN': 'CDMA 800 / 1900 - USA',
   '3G bands': 'HSDPA 850 / 900 / 1700(AWS) / 1900 / 2100 - Global, USA',
   '4G bands': 'LTE band 1(2100), 2(1900), 3(1800), 4(1700/2100), 5(850), 7(2600), 8(900), 12(700), 13(700), 17(700), 18(800), 19(800), 20(800), 25(1900), 26(850), 28(700), 32(1500), 38(2600), 39(1900), 40(2300), 41(2500), 66(1700/2100) - Global',
   'Speed': 'HSP

### Expanding the Functionality of Our Classes

The real data we want are the specs of the device which are stored under **device_specs** mostly in the form of dictionaries. The one "spec" which is not a dictionary is **Opinions**. We will handle this spec seperately. 

The general procedure that we want to follow is:

1. Split the **device_specs** of device object and iterate through each of the key/value pairs and create new attributes for the device.
2. Set the attribute value using using the lower case name of spec while also handling '  ' , ' -', '– ' characters which can appear in the spec name (Selfie Camera, batsize-hl). This makes it possible to call the attribute of a device.
3. Use a list called **feature_list** to keep tack of all the attributes (features or specs) of a **Device**. Use a dict called **all_features_dict** to keep track of all the features collected across all devices. This dict itself is an attribute of **FeatureGen**.
4. Account for any missing values. Although I double checked to make sure that any missing values were parsed as **np.NaN** when building the dataset, for some reason, some of the "sub specs" (Battery and Tests in the above output) parsed as the str 'NaN'. Ensure that the Feature Generator has a function to account for these.

In [0]:
class FeatureGen:
  '''FeatureGen will contain a dict of all features from all the devices called all_features_dict
  
     It also contains a collection of useful methods for parsing data from a Devices object
  '''
  
  # initialize a collector dictionary to collect features from all devices
  # out dict will have a feature called device_notes for collecting specs where the key is nan
  all_features_dict = {'device_notes':None}
  
  def gen_from_dict(self, spec_value, spec_name):
    '''A function for generating more features from the value of a spec if the spec_value is also a dict'''
    for key, value in spec_value.items():
      # if key or value is np.NaN it will crash split_string, so convert it
      if pd.isna(key):
        key = 'nan'
      if pd.isna(value):
        value = 'nan'
      key_ = self.split_string(key)
      # in some cases, the spec_ is NaN or '' and in other case the value is nan or ''
      # account for these cases 
      # don't take any action, this is a waste attribute, we don't want to add it to the feature list of a device
      if (key_ == 'nan' or key_ == '') and (value == '' or value == 'nan'):
        pass
      # if key is 'NaN' or '', but the value is not, we want to create a note about the value under device_notes
      elif (key_ == 'nan' or key_ == '') and (value != '' or value != 'nan'):
        # create a new note using the key and value which is of the format 'battery_-This device has great battery life'
        self.device_notes.setdefault(spec_name, value)
      # if the key is not empty and the value is, we do not want this spec
      elif (key_ != 'nan' or key_ != '') and (value == '' or value == 'nan'):
        pass
      # if none of the above issues are there, we can add the feature as an attribute of the device
      else:
        new_key = spec_name + '_' + key_
        setattr(self, new_key, value)
        self.set_all_features(new_key)
        self.create_feature(new_key)

In [0]:
class Device(FeatureGen):
  '''A class for working with device data scrapped on GSMArena''' 
  # we want to initalize a list to keep track of all the features collected for THIS device
  features_list = []
  
  # initliaze the class using a device 
  def __init__(self, device, device_id, maker_name, maker_id):
    self.maker_name = maker_name
    self.create_feature('maker_name')
    self.set_all_features('maker_name')
    
    self.maker_id = maker_id
    self.create_feature('maker_id')
    self.set_all_features('maker_id')
    
    self.device_id = device_id
    self.create_feature('device_id')
    self.set_all_features('device_id')
    
    # set the device_info as attributes of the Device 
    for device_info_name, device_info in device.items():
      # emulates the functionality of self.varable = value
      setattr(self, device_info_name, device_info)
      self.create_feature(device_info_name)
      self.set_all_features(device_info_name)
      
    # go through each spec and parse it if needed before adding to the devices attributes
    self.device_notes = {}
    for spec, value in device['device_specs'].items():
      spec_ = self.split_string(spec)
      if isinstance(value, dict):
        self.gen_from_dict(value, spec_)
        self.create_feature(spec_)
      else:
        setattr(self, spec_, value)
      
      
  # helping functions    
  def split_string(self, spec_name):
    '''A function for changing the ' ' and '-' demlimiter
       in a spec_name to  '_'
       
       Given 'Selfie Camera', returns selfie_camera
    '''
    split_spec_pattern = re.compile('\s|-|–')
    split_specs = re.split(split_spec_pattern, spec_name)
    return '_'.join(split_specs).lower()
  
  
  def create_feature(self, spec_name):
    '''A function that allows us to consolidate the names of all features recovered from the GIVEN device'''
    if spec_name not in self.features_list:
      self.features_list.append(spec_name)
      
      
  def set_all_features(self, spec_name):
    '''A function that allows us to consolidate the names of all features recovered from ALL devices'''
    if spec_name not in FeatureGen.all_features_dict:
      FeatureGen.all_features_dict.setdefault(spec_name, None)

In [0]:
device1 = Device(device, device_id, maker_name, maker_id)
device1.__dict__

### Create the Skeleton Dataset

Since we have a class which can extract all the specs (read as *features*) from the spec sheet of a device, we can build the first draft of the dataset. 

We will now create a function called **create_df** which creates **Device** objects out of all the devices in **devices_dict**. Since the features from all devices are being added to the dictionary **all_features_collector** located inside the class **FeatureGen**, we will use the keys of this dictionary as the as columns.

In [0]:
def create_devices_from_data(devices_dict):
  '''A function for creating objects out of all the devices stored in devices_dict
  
     Takes in dict and returns a list with all the device objects created from the data  
  '''
  # create an empty list for collecting all the device objects
  devices_collector = []
  # each maker has a maker_id starting from 0 for all the makers
  for maker_id, (maker_name, devices_info) in zip(range(len(devices_dict.keys())), devices_dict.items()):
    maker_name_split = [split.upper() for split in re.split(re.compile(' |-'), maker_name)]
    maker_name_ = ''.join(maker_name_split)
    
    # iterate through each device under a maker and use maker_name_ and device_num to create a unique device_id
    for device_num, device in devices_info.items():
      device_id = maker_name_ + '_' + device_num
      # create the device object using the Device class and then append the object to the collector array 
      device_ = Device(device, device_id, maker_name, maker_id)
      devices_collector.append(device_)
  
  return devices_collector

In [0]:
FeatureGen.all_features_dict.keys()

dict_keys(['device_notes', 'maker_name', 'maker_id', 'device_id', 'device_name', 'device_info', 'device_img_link', 'device_link', 'device_specs', 'network_technology', 'network_2g_bands', 'network_3g_bands', 'network_4g_bands', 'network_speed', 'network_gprs', 'network_edge', 'launch_announced', 'launch_status', 'body_dimensions', 'body_weight', 'body_build', 'body_sim', 'display_type', 'display_size', 'display_resolution', 'display_protection', 'platform_os', 'platform_chipset', 'platform_cpu', 'platform_gpu', 'memory_card_slot', 'memory_internal', 'main_camera_triple', 'main_camera_features', 'main_camera_video', 'selfie_camera_single', 'selfie_camera_features', 'selfie_camera_video', 'sound_loudspeaker', 'sound_3.5mm_jack', 'comms_wlan', 'comms_bluetooth', 'comms_gps', 'comms_nfc', 'comms_radio', 'comms_usb', 'features_sensors', 'battery_charging', 'misc_colors', 'misc_price', 'tests_performance', 'tests_display', 'tests_camera', 'tests_loudspeaker', 'tests_audio_quality', 'tests_ba

In [0]:
# test it out
devices_collector = create_devices_from_data(devices_dict)

print(len(devices_collector))
print(devices_collector[0])

9536
<__main__.Device object at 0x74af7fef3860>


As a next step, we want to create a function which takes in a feature name from the list of keys of **FeatureGen.all_devices_dict** and checks if this feature is present as an attribute of the device objects in **devices_collector** to return a list which corresponds to the feature column values. Each key present in **all_divces_dict** will be a feature of the final feature table

If the device has a given attribute, then the value is added to a list, else we will add a **np.NaN** to the list to denote a missing value. Finally, this list will be set as the value of feature (i.e key). The 

In [0]:
def create_feature_column(feature_name, devices_collector):
  '''A function for creating a feature column of a given name using devices_collector'''  
  collector_array = []

  for device in devices_collector:
    feature = getattr(device, feature_name, None)
    if not feature:
      collector_array.append(np.NaN)
    else:
      collector_array.append(feature)
  return collector_array

In [0]:
# test it out
feature_name = 'banner_batsize_hl'
feature_column = create_feature_column(feature_name, devices_collector)
print(feature_column)

['4500', '3400', '4080', '2000', '6100', '4020', '3000', '5000', '2000', '4420', '2870', '2000', '2000', '4000', '4000', '2420', '2420', '2000', '2000', '5910', '4550', '5910', '5910', '2300', '2000', '1300', '1300', '2000', '2300', '2000', '2700', '2100', '3500', '2500', '1300', '4600', '3400', '3400', '3700', '2000', '2500', '1630', '2955', '2955', '4000', '2000', '3300', '1500', '2400', '7300', '4960', '4960', '2000', '1300', '2000', '1760', '2640', '2710', '3420', '1300', '1500', '1500', '1300', '1460', '3260', '3260', '9800', '9800', '9800', '9800', '1300', nan, '3260', '1500', '3260', '1530', '1530', '1300', '1300', '1300', '1500', '1500', '1500', '1500', '1400', '1350', '1090', '1090', '970', '1500', '1350', '1350', '1140', '1140', '1140', '1260', '1530', '1530', '1530', '1530', '4080', '3500', '3500', '3060', '3000', '2460', '2050', '4000', '4000', '3000', '3000', '3000', '3000', '2460', '3000', '4000', '2580', '2850', '2800', '4000', '4000', '2050', '2800', '2620', '3100', '20

With the all of these peices in place, we can finally define a function called **create_df** which takes in as the input, the **devices_dict** from earlier and returns a *Pandas DataFrame* with all the fratures extracted from all the devices.

In [0]:
# create a function that uses the above functions and the class to create a data table for devices_data
def create_df(devices_dict):
  # create Device objects from of devices_dict
  devices_collector = create_devices_from_data(devices_dict)
  
  # get a dict of all features collected across all devices
  all_features_dict = FeatureGen.all_features_dict
  
  # create the feature columns for each feature and set it as the new value of the all_features_dict
  for feature_name, _ in all_features_dict.items():
    col = create_feature_column(feature_name, devices_collector)
    all_features_dict[feature_name] = col
  
  # create a DataFrame from the dict and return it
  return pd.DataFrame(all_features_dict)

In [0]:
# test it out
df = create_df(devices_dict)

In [0]:
# take a peek at the dataset
df.head()

**Note** that Pandas has converted the **maker_id** *0* into the **NaN** object.  We can also get the **tail** of the **DataFrame** to double check that the  rest of the **maker_name**'s are uneffected.

In [0]:
# get the first row of the df and verify that maker_name is nan
row1 = df.iloc[0, :]
pd.isna(row1['maker_id'])

True

In [0]:
# get the tail of the df to verify that other makers have their id unaltered
df.tail()

Unnamed: 0,device_notes,maker_name,maker_id,device_id,device_name,device_info,device_img_link,device_link,device_specs,network_technology,...,features_messaging,features_games,features_java,sound_alert_types,features_clock,features_alarm,features_languages,main_camera_quad,selfie_camera_triple,main_camera_five
9531,"{'features': '- Organizer- Voice memo', 'batte...",ZTE,113.0,ZTE_245,F600,ZTE F600 phone. Announced 2009. Features 2.4″...,https://cdn2.gsmarena.com/vv/bigpic/zte-f600.jpg,https://www.gsmarena.com/zte_f600-3102.php,"{'Network': {'Technology': 'GSM / UMTS', '2G b...",GSM / UMTS,...,"SMS, MMS, Email",Yes + downloadable,"Yes, MIDP 2.0",,,,,,,
9532,"{'features': '- Organizer- Voice memo', 'batte...",ZTE,113.0,ZTE_246,F103,ZTE F103 phone. Announced 2009. Features 2.0″...,https://cdn2.gsmarena.com/vv/bigpic/zte-f103.jpg,https://www.gsmarena.com/zte_f103-3099.php,"{'Network': {'Technology': 'GSM / UMTS', '2G b...",GSM / UMTS,...,"SMS, MMS, Email",Yes + downloadable,"Yes, MIDP 2.0",,,,,,,
9533,"{'features': '- Organizer- Voice memo', 'batte...",ZTE,113.0,ZTE_247,F101,ZTE F101 phone. Announced 2009. Features 2.0″...,https://cdn2.gsmarena.com/vv/bigpic/zte-f101.jpg,https://www.gsmarena.com/zte_f101-3101.php,"{'Network': {'Technology': 'GSM / UMTS', '2G b...",GSM / UMTS,...,"SMS, MMS, Email",Yes + downloadable,"Yes, MIDP 2.0",,,,,,,
9534,"{'features': '- Organizer- Voice memo', 'batte...",ZTE,113.0,ZTE_248,F100,ZTE F100 phone. Announced 2009. Features 2.0″...,https://cdn2.gsmarena.com/vv/bigpic/zte-f100.jpg,https://www.gsmarena.com/zte_f100-3100.php,"{'Network': {'Technology': 'GSM / UMTS', '2G b...",GSM / UMTS,...,"SMS, MMS, Email",Yes + downloadable,"Yes, MIDP 2.0",,,,,,,
9535,"{'network': 'GSM 850 / 1900', 'camera': 'No', ...",ZTE,113.0,ZTE_249,Coral200 Sollar,ZTE Coral200 Sollar phone. Announced May 2007....,https://cdn2.gsmarena.com/vv/bigpic/zte-coral2...,https://www.gsmarena.com/zte_coral200_sollar-3...,"{'Network': {'Technology': 'GSM', '2G bands': ...",GSM,...,SMS,Yes,No,,,,,,,


In [0]:
# take a look at the shape of the dataset and the columns
print(df.shape)
print(df.info())

(9536, 93)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9536 entries, 0 to 9535
Data columns (total 93 columns):
device_notes                9535 non-null object
maker_name                  9536 non-null object
maker_id                    9436 non-null float64
device_id                   9536 non-null object
device_name                 9536 non-null object
device_info                 9536 non-null object
device_img_link             9536 non-null object
device_link                 9536 non-null object
device_specs                9536 non-null object
network_technology          9536 non-null object
network_2g_bands            9521 non-null object
network_3g_bands            5728 non-null object
network_4g_bands            2392 non-null object
network_speed               5755 non-null object
network_gprs                9504 non-null object
network_edge                9513 non-null object
launch_announced            9520 non-null object
launch_status               9536 non-null object

### Parsing Data

In the previous step, we managed to build what I call a "skeleton" version of the dataset. The name implies that it has all of the bare bones components necessary for exploratory data analysis, which will be the topic of the next tutorial. However, we need to parse the data points first. All the data are currently strings and we need to write functions to extract and format the data in type that we want.

TOo help do this we will use a new class called **ParserFunctions** which will be inherited by **FeatureGen**

In [0]:
# create a class for housing the parsing functions

class ParsingFunctions:
  '''A class for housing all the parsing function which will be used on device specs data'''
  pass

If we take a look at the data under any device, we can see that the **banner_SPEC_hl** contain repeated information from the other specs. The reason I chose to extract these features is that they make the parsing of certain types of data easier. A prime example of this is the feature **banner_batsize_hl** which tells us the [capacity](http://web.mit.edu/evt/summary_battery_specifications.pdf) of the phone's battery in *mAh* without the need to parse any extra string. While writing regular expressions can be fun, to do it for this many features is a very time consuming process and we should take any help we can. By using the banner feature, simply need to convert the string into an **np.float64** object in order to make it usable for statistical analysis and data visualization.

Since we will be defining parsing functions for a large number of features, I will not be going through each one in detail. I will show you how you can define your own parsing functions for two features using **ParsingFunctions** and I **highly suggest** that you use use the comments and the functions' docstring to understand how the rest work.

In [0]:
class ParsingFunctions:
  '''A class for housing all the parsing function which will be used on device specs data'''
  
  # all specs passed into FunctionGen will also be passed into ParsingFunctions
  # we will create a list which contains the name of the spec_values which we want to parse
  # the feature will only be parsed if the name is on the list
  allow_parsing = ['banner_batsize_hl', 'banner_displayres_hl']
  
  
  def parse_spec(self, spec_, value):
    '''A function for parsing each of the spec to get the information we want
       This function is meant to be called on each iteration of banner_spec/value pair
    '''
    # check if spec_ is the in the allow_parsing list
    if spec_ in ParsingFunctions.allow_parsing:
      # parsing function for a feature MUST be stored as a function of the name, parse_feature_name
      # for example, parse_banner_batsize_hl is the parsing function for the feature banner_batsize_hl
      parsing_function_name = 'parse_' + spec_
      parsing_function = getattr(self, parsing_function_name)
      parsed_values = parsing_function(value)
      # parsing functions will be written in such a way that the features and values are part of a dict
      for feature_name, feature_value in parsed_values.items():
        setattr(self, feature_name, feature_value)
        self.set_all_features(feature_name)
        self.create_feature(feature_name)
    # if spec_ in not in the allow_parsing list, simply set the value current value of the spec
    else:
      setattr(self, spec_, value)
      self.set_all_features(spec_)
      self.create_feature(spec_)
        
        
        
  def parse_banner_batsize_hl(self, value):
    '''A function for parsing the batsize_hl

       Takes in str '3800' and returns the np.floats64 3800
    '''
    return {'batsize':np.float64(value)}
  
  
  def parse_banner_displayres_hl(self, value):
    '''A function for parsing the displayres_hl
    
       Takes in str '1280x1920 pixels' and returns two np.float64's for the length and height of the screen
    '''
    res_pattern = re.compile('(\d+)x(\d+)')
    return_val =  re.findall(res_pattern, value)
    feature_names = ['displayres_len', 'displayres_height']
    if not return_val:
      return {'displayres_len':np.NaN, 'displayres_height':np.NaN}
    else:
      return  {name:np.float64(i) for name, i  in zip(feature_names, return_val[0])}

In [0]:
class FeatureGen(ParsingFunctions):
  '''FeatureGen will contain a dict of all features from all the devices called all_features_dict
  
     It also contains a collection of useful methods for parsing data from a Devices object
  '''
  
  # initialize a collector dictionary to collect features from all devices
  # out dict will have a feature called device_notes for collecting specs where the key is nan
  all_features_dict = {'device_notes':None}
  
  def gen_from_dict(self, spec_value, spec_name):
    '''A function for generating more features from the value of a spec if the spec_value is also a dict'''
    for key, value in spec_value.items():
      if pd.isna(key):
        key = 'nan'
      if pd.isna(value):
        value = 'nan'
      key_ = self.split_string(key)
      # in some cases, the spec_ is NaN or '' and in other case the value is nan or ''
      # account for these cases 
      # don't take any action, this is a waste attribute, we don't want to add it to the feature list of a device
      if (key_ == 'nan' or key_ == '') and (value == '' or value == 'nan'):
        pass
      # if key is 'NaN' or '', but the value is not, we want to create a note about the value under device_notes
      elif (key_ == 'nan' or key_ == '') and (value != '' or value != 'nan'):
        # create a new note using the key and value which is of the format 'battery_-This device has great battery life'
        self.device_notes.setdefault(spec_name, value)
      # if the key is not empty and the value is, we do not want this spec
      elif (key_ != 'nan' or key_ != '') and (value == '' or value == 'nan'):
        pass
      # if none of the above issues are there, we can add the feature as an attribute of the device
      else:
        new_key = spec_name + '_' + key_
        self.parse_spec(new_key, value)

In [0]:
class Device(FeatureGen):
  '''A class for working with device data scrapped on GSMArena''' 
  # we want to initalize list to keep track of all the features collected for THIS device
  features_list = []
  
  # initliaze the class using a device 
  def __init__(self, device, device_id, maker_name, maker_id):
    self.maker_name = maker_name
    self.create_feature('maker_name')
    self.set_all_features('maker_name')
    
    self.maker_id = maker_id
    self.create_feature('maker_id')
    self.set_all_features('maker_id')
    
    self.device_id = device_id
    self.create_feature('device_id')
    self.set_all_features('device_id')
    
    # set the device_info as attributes of the Device 
    for device_info_name, device_info in device.items():
      # emulates the functionality of self.varable = value
      setattr(self, device_info_name, device_info)
      self.create_feature(device_info_name)
      self.set_all_features(device_info_name)
    
      
    self.device_notes = {}
    for spec, value in device['device_specs'].items():
      spec_ = self.split_string(spec)
#       print(spec_ , ':  ',value)
      if isinstance(value, dict):
        self.gen_from_dict(value, spec_)
        self.create_feature(spec_)
      else:
        setattr(self, spec_, value)
      
      
  def split_string(self, spec_name):
    '''A function for changing the ' ' and '-' demlimiter
       in a spec_name to  '_'
       
       Given 'Selfie Camera', returns selfie_camera
    '''
    split_spec_pattern = re.compile('\s|-|–')
    split_specs = re.split(split_spec_pattern, spec_name)
    return '_'.join(split_specs).lower()
  
  
  def create_feature(self, spec_name):
    '''A function that allows us to consolidate the names of all features recovered from the GIVEN device'''
    if spec_name not in self.features_list:
      self.features_list.append(spec_name)
      
      
  def set_all_features(self, spec_name):
    '''A function that allows us to consolidate the names of all features recovered from ALL devices'''
    if spec_name not in FeatureGen.all_features_dict:
      FeatureGen.all_features_dict.setdefault(spec_name, None)

Repeat the same process we saw earlier to create a new **Device** object

In [0]:
maker_name = 'Samsung'

# get Samsung's maker_id form the list we just created
for maker in makers:
  if maker[1] == maker_name:
    maker_id = maker[0]
    
print(maker_id)

# get the data for Galaxy S10 from the devices_dict 
device_name = 'Galaxy S10'
for device_num, device in devices_dict[maker_name].items():
  if device['device_name'] == device_name:
    device_id = maker_name.upper() + '_' + device_num 
    break
    
print(device_id)
print(device)

84
SAMSUNG_3
{'device_name': 'Galaxy S10', 'device_info': 'Samsung Galaxy S10 Android smartphone. Announced Feb 2019. Features 6.1″ Dynamic AMOLED display, Exynos 9820 Octa chipset, 3400 mAh battery, 512 GB storage, 8 GB RAM, Corning Gorilla Glass 6.', 'device_img_link': 'https://cdn2.gsmarena.com/vv/bigpic/samsung-galaxy-s10.jpg', 'device_link': 'https://www.gsmarena.com/samsung_galaxy_s10-9536.php', 'device_specs': {'Network': {'Technology': 'GSM / CDMA / HSPA / EVDO / LTE', '2G bands': 'GSM 850 / 900 / 1800 / 1900 - SIM 1 & SIM 2 (dual-SIM model only)', 'NaN': 'CDMA 800 / 1900 - USA', '3G bands': 'HSDPA 850 / 900 / 1700(AWS) / 1900 / 2100 - Global, USA', '4G bands': 'LTE band 1(2100), 2(1900), 3(1800), 4(1700/2100), 5(850), 7(2600), 8(900), 12(700), 13(700), 17(700), 18(800), 19(800), 20(800), 25(1900), 26(850), 28(700), 32(1500), 38(2600), 39(1900), 40(2300), 41(2500), 66(1700/2100) - Global', 'Speed': 'HSPA 42.2/5.76 Mbps, LTE-A (7CA) Cat20 2000/150 Mbps', 'GPRS': 'Yes', 'EDGE': '

In [0]:
device1 = Device(device, device_id, maker_name, maker_id)

In [0]:
# check that the the data was parsed properly
type(device1.batsize)
type(device1.displayres_height)
type(device1.displayres_len)

print('device_name: {} | batsize: {}, {} | displayres_height: {}, {} | displayres_len: {}, {}'
      .format(device1.device_name, device1.batsize, type(device1.batsize)
             , device1.displayres_height, type(device1.displayres_height)
             , device1.displayres_len, type(device1.displayres_len)
             ) 
     )

device_name: Galaxy S10 | batsize: 3400.0, <class 'numpy.float64'> | displayres_height: 3040.0, <class 'numpy.float64'> | displayres_len: 1440.0, <class 'numpy.float64'>


### Creating Your Own Parsing Functions

Since each person using this dataset will have their own set of questions to answer, it would be best if we can modify the functionality of  the **ParsingFunctions**  class to take in a list of user defined parsing functions which will then be applied to appropriate colum to get the feature that we want.

To acheive this, we create a new *classmethod* called **add_new_parser** which can take in either a single function or a list of functions. To parse a device attribute when the device is created, we first need to initialize **ParsingFunctions** by calling the **add_new_parser** method as shown in the code below.

**NOTE:** The two parsing function we defined earlier are no longer defined in the **ParsingFunctions** class. The fact that they are user defined offers us a lot of flexibility.  The features to be parsed in the **allow_parsing** list are also also populated by the **add_new_parser** method. 

In [0]:
class ParsingFunctions:
  '''A class for housing all the parsing function which will be used on device specs data'''
  
  # a list to keep track of which features have parsing functions
  allow_parsing = []
  
  def parse_spec(self, spec_, value):
    '''A function for parsing each of the spec to get the information we want
       This function is meant to be called on each iteration of banner_spec/value pair
    '''
    # check if spec_ is the in the allow_parsing list
    if spec_ in ParsingFunctions.allow_parsing:
      # parsing function for a feature MUST be stored as a function of the name, parse_feature_name
      # for example, parse_banner_batsize_hl is the parsing function for the feature banner_batsize_hl
      parsing_function_name = 'parse_' + spec_
      parsing_function = getattr(ParsingFunctions, parsing_function_name)
      parsed_values = parsing_function(value)
      # parsing functions will be written in such a way that the features and values are part of a dict
      for feature_name, feature_value in parsed_values.items():
        setattr(self, feature_name, feature_value)
        self.set_all_features(feature_name)
        self.create_feature(feature_name)
    # if spec_ in not in the allow_parsing list, simply set the value current value of the spec
    else:
      setattr(self, spec_, value)
      self.set_all_features(spec_)
      self.create_feature(spec_)
      
  @classmethod   
  def add_new_parsers(cls, new_parsers):
    '''A function for creating user defined parsers, initiliaze new parsers before calling Device
    
       Accepts a single function or a list of functions   
    '''
    # if a list of parsers is provided, check if each parser is a function that follows our name format
    if isinstance(new_parsers, list):
      for n, parser in enumerate(new_parsers):
        if not callable(parser):
          raise TypeError('parser at index {} in new_parsers must be callable, not of type {}'.format(n, type(parser)))
        else:
          parsing_function_name = parser.__name__
          # slice the name, say parse_batsize_hl at parse_  i.e [6:]
          col_name = parsing_function_name[6:]
          if col_name not in cls.allow_parsing:
            cls.allow_parsing.append(col_name)
            setattr(cls, parsing_function_name, parser)
          else:
            print('WARNING! function for parsing {} already exists! {} was not added'.format(col_name, parsing_function_name))
    elif callable(new_parsers):
      parsing_function_name = new_parsers.__name__
      col_name = parsing_function_name[6:]
      if col_name not in cls.allow_parsing:
        cls.allow_parsing.append(col_name)
        setattr(cls, parsing_function_name, new_parsers)
      else:
        print('WARNING! function for parsing {} already exists! {} was not added'.format(col_name, parsing_function_name))
    else:
      raise TypeError('parser must be callable, not of type {}'.format(type(new_parsers)))

In [0]:
def parse_banner_batsize_hl(value):
  '''A function for parsing the batsize_hl

     Takes in str '3800' and returns the np.floats64 3800
  '''
  return {'batsize':np.float64(value)}

  
def parse_banner_displayres_hl(value):
  '''A function for parsing the displayres_hl

     Takes in str '1280x1920 pixels' and returns two np.float64's for the length and height of the screen
  '''
  res_pattern = re.compile('(\d+)x(\d+)')
  return_val =  re.findall(res_pattern, value)
  feature_names = ['displayres_len', 'displayres_height']
  if not return_val:
    return {'displayres_len':np.NaN, 'displayres_height':np.NaN}
  else:
    return  {name:np.float64(i) for name, i  in zip(feature_names, return_val[0])}

In [0]:
# add the user defined parsing function to the ParsingFunctions class
print(ParsingFunctions.allow_parsing)
parsing_functions = [parse_banner_batsize_hl, parse_banner_displayres_hl]
ParsingFunctions.add_new_parsers(parsing_functions)
print(ParsingFunctions.allow_parsing)

[]
['banner_batsize_hl', 'banner_displayres_hl']


In [0]:
# create a test Device object and test it out
device1 = Device(device, device_id, maker_name, maker_id)
device1.displayres_height

3040.0


Armed with the [data](https://github.com/vigvisw/end2endml) and the knowledge that you have learned here, you can write a parsing function for any spec in the *devices_data.txt*, as long as you follow the rules below.

**Rules For Writing Parsing Functions For Devices Data**

1. Currently parsing functions can only be defined for **specs**. This includes everything we collected on a device's page. I have kept is this way because, these are things which I conisider a device's feature. You can modify the three classes to parse device information such as **device_name**, **maker_name** etc, if you choose.
2. The parsing function **must** follow the name format 'parse_' + spec_name. For example, if you are trying to parse the feature **platform_os** as seen in the columns of the *Skeleton Dataset*, the parsing function which you define must be be named **parse_platform_os**.
3. The return value of the parsing function **must** be a dictionary of the format **{*new_feature_name*:parsed_value,.........}**. The keys of *new_feature_name* will be used used to create new parsed feature column. **Note** that the **new_feature_name**(s) in the above dictionary will replace the input feature in the **all_features_dict**.
4. Each parsing function that you want to use must be first defined and then passed as a list (or use the function object itself, if using only one) to **ParsingFunction.add_new_parsers()** as an argument.

### Putting It All Together

Everything we have made so far can be put together and implemented as a module called **deviceparser**.

The full source code is copied below and module, called *deviceparser.py*, can be cloned/download on my [GitHub page](https://github.com/vigvisw/devicedataparser).

In [0]:
import json
import re 
import numpy as np
import pandas as pd

In [0]:
class ParsingFunctions:
  '''A class for housing all the parsing function which will be used on device specs data'''
  
  # a list to keep track of which features have parsing functions
  allow_parsing = []
  
  def parse_spec(self, spec_, value):
    '''A function for parsing each of the spec to get the information we want
       This function is meant to be called on each iteration of banner_spec/value pair
    '''
    # check if spec_ is the in the allow_parsing list
    if spec_ in ParsingFunctions.allow_parsing:
      # parsing function for a feature MUST be stored as a function of the name, parse_feature_name
      # for example, parse_banner_batsize_hl is the parsing function for the feature banner_batsize_hl
      parsing_function_name = 'parse_' + spec_
      parsing_function = getattr(ParsingFunctions, parsing_function_name)
      parsed_values = parsing_function(value)
      # parsing functions will be written in such a way that the features and values are part of a dict
      for feature_name, feature_value in parsed_values.items():
        setattr(self, feature_name, feature_value)
        self.set_all_features(feature_name)
        self.create_feature(feature_name)
    # if spec_ in not in the allow_parsing list, simply set the value current value of the spec
    else:
      setattr(self, spec_, value)
      self.set_all_features(spec_)
      self.create_feature(spec_)
      
  @classmethod   
  def add_new_parsers(cls, new_parsers):
    '''A function for creating user defined parsers, initiliaze new parsers before calling Device
    
       Accepts a single function or a list of functions   
    '''
    # if a list of parsers is provided, check if each parser is a function that follows our name format
    if isinstance(new_parsers, list):
      for n, parser in enumerate(new_parsers):
        if not callable(parser):
          raise TypeError('parser at index {} in new_parsers must be callable, not of type {}'.format(n, type(parser)))
        else:
          parsing_function_name = parser.__name__
          # slice the name, say parse_batsize_hl at parse_  i.e [6:]
          col_name = parsing_function_name[6:]
          if col_name not in cls.allow_parsing:
            cls.allow_parsing.append(col_name)
            setattr(cls, parsing_function_name, parser)
          else:
            print('WARNING! function for parsing {} already exists! {} was not added'.format(col_name, parsing_function_name))
    elif callable(new_parsers):
      parsing_function_name = new_parsers.__name__
      col_name = parsing_function_name[6:]
      if col_name not in cls.allow_parsing:
        cls.allow_parsing.append(col_name)
        setattr(cls, parsing_function_name, new_parsers)
      else:
        print('WARNING! function for parsing {} already exists! {} was not added'.format(col_name, parsing_function_name))
    else:
      raise TypeError('parser must be callable, not of type {}'.format(type(new_parsers)))
      
  @classmethod
  def clear_existing_parsers(cls):
    '''A function for clearing current parsers'''
    cls.allow_parsing = []
    print('All exisiting parsing functions have been cleared!')

In [0]:
class FeatureGen(ParsingFunctions):
  '''FeatureGen will contain a dict of all features from all the devices called all_features_dict
  
     It also contains a collection of useful methods for parsing data from a Devices object
  '''
  
  # initialize a collector dictionary to collect features from all devices
  # out dict will have a feature called device_notes for collecting specs where the key is nan
  all_features_dict = {'device_notes':None}
  
  def gen_from_dict(self, spec_value, spec_name):
    '''A fnction for generating more features from the value of a spec if the spec_value is also a dict'''
    for key, value in spec_value.items():
      if pd.isna(key):
        key = 'nan'
      if pd.isna(value):
        value = 'nan'
      key_ = self.split_string(key)
      # in some cases, the spec_ is NaN or '' and in other case the value is nan or ''
      # account for these cases 
      # don't take any action, this is a waste attribute, we don't want to add it to the feature list of a device
      if (key_ == 'nan' or key_ == '') and (value == '' or value == 'nan'):
        pass
      # if key is 'NaN' or '', but the value is not, we want to create a note about the value under device_notes
      elif (key_ == 'nan' or key_ == '') and (value != '' or value != 'nan'):
        # create a new note using the key and value which is of the format 'battery_-This device has great battery life'
        self.device_notes.setdefault(spec_name, value)
      # if the key is not empty and the value is, we do not want this spec
      elif (key_ != 'nan' or key_ != '') and (value == '' or value == 'nan'):
        pass
      # if none of the above issues are there, we can add the feature as an attribute of the device
      else:
        new_key = spec_name + '_' + key_
        self.parse_spec(new_key, value)

In [0]:
class Device(FeatureGen):
  '''A class for working with device data scrapped on GSMArena''' 
  # we want to initalize a list to keep track of all the features collected for THIS device
  features_list = []
  
  
  # initliaze the class using a device 
  def __init__(self, device, device_id, maker_name, maker_id):
    # start adding attributes to the device object  
    self.maker_name = maker_name
    self.create_feature('maker_name')
    self.set_all_features('maker_name')
    
    self.maker_id = maker_id
    self.create_feature('maker_id')
    self.set_all_features('maker_id')
    
    self.device_id = device_id
    self.create_feature('device_id')
    self.set_all_features('device_id')
                
    # set the device_info as attributes of the Device 
    for device_info_name, device_info in device.items():
      # emulates the functionality of self.varable = value
      setattr(self, device_info_name, device_info)
      self.create_feature(device_info_name)
      self.set_all_features(device_info_name)
    
    # all device "specs" exception opinion are a dict of sub specs, which we need to parse
    # these will be treated separetly using FeatureGen and ParsingFunctions
    self.device_notes = {}
    for spec, value in device['device_specs'].items():
      spec_ = self.split_string(spec)
      if isinstance(value, dict):
        self.gen_from_dict(value, spec_)
        self.create_feature(spec_)
      else:
        setattr(self, spec_, value)
      
      
  def split_string(self, spec_name):
    '''A function for changing the ' ' and '-' demlimiter
       in a spec_name to  '_'
       
       Given 'Selfie Camera', returns selfie_camera
    '''
    split_spec_pattern = re.compile('\s|-|–')
    split_specs = re.split(split_spec_pattern, spec_name)
    return '_'.join(split_specs).lower()
  
  
  def create_feature(self, spec_name):
    '''A function that allows us to consolidate the names of all features recovered from the GIVEN device'''
    if spec_name not in self.features_list:
      self.features_list.append(spec_name)
      
      
  def set_all_features(self, spec_name):
    '''A function that allows us to consolidate the names of all features recovered from ALL devices'''
    if spec_name not in FeatureGen.all_features_dict:
      FeatureGen.all_features_dict.setdefault(spec_name, None)
      
  
  @staticmethod
  def read_devices_json(file_path):
    '''A function for reading in a JSON file.

       Takes in the string file_path and returns a dict
    '''
    with open(file_path, 'r', encoding='utf-8') as file:
      devices_dict = json.load(file)
      return devices_dict
    
    
  @staticmethod
  def list_makers(devices_dict):
    '''A function that takes in the loaded devices_dict data and returns a list of makers
     
       Returns a list of the form  [(0, 'Acer'), .......] for all makers in the devices_dict
    '''
    makers = [(x, y) for x, y in zip(range(devices_dict.keys().__len__()), devices_dict.keys())]
    return makers
    
    
  @staticmethod
  def create_devices_from_data(devices_dict):
    '''A function for creating objects out of all the devices stored in devices_dict

       Retuns a list of Device objects for all devices in the devices_dict 
    '''
    # we want to initalize a list for collecting all the device objects
    devices_collector = []
    # each maker has a maker_id starting from 0 for all the makers
    for maker_id, (maker_name, devices_info) in zip(range(len(devices_dict.keys())), devices_dict.items()):
      maker_name_split = [split.upper() for split in re.split(re.compile(' |-'), maker_name)]
      maker_name_ = ''.join(maker_name_split)

      # iterate through each device under a maker and use maker_name_ and device_num to create a unique device_id
      for device_num, device in devices_info.items():
        device_id = maker_name_ + '_' + device_num
        # create the device object using the Device class and then append the object to the collector array 
        device_ = Device(device, device_id, maker_name, maker_id)
        devices_collector.append(device_)
    return devices_collector
  
        
  @staticmethod     
  def create_feature_column(feature_name, devices_collector):
    '''A function for creating a feature column of a given name using devices_collector'''  
    collector_array = []
    for device in devices_collector:
      feature = getattr(device, feature_name, None)
      if not feature:
        collector_array.append(np.NaN)
      else:
        collector_array.append(feature)
    return collector_array

        
  @staticmethod
  def create_df(devices_dict):
    '''A function for creating a DataFrame using all the data from all the Device objects'''
    
    if not devices_dict:
      raise AttributeError('This function cannot be run of devices_dict if is empty')
    
    # create Device objects from devices_dict
    devices_collector = Device.create_devices_from_data(devices_dict)

    # get a dict of all features collected across all devices
    all_features_dict = FeatureGen.all_features_dict

    # create the feature columns for each feature and set it as the new value of the all_features_dict
    for feature_name, _ in all_features_dict.items():
      col = Device.create_feature_column(feature_name, devices_collector)
      all_features_dict[feature_name] = col

    # create a DataFrame from the dict and return it
    return pd.DataFrame(all_features_dict)

Instructions For Using deviceparser.py

1. Download *deviceparser.py* and move it to [Python's working directory](https://stackoverflow.com/questions/17359698/how-to-get-the-current-working-directory-using-python-3/17361545).
2. Import the module into your IDE.
3. At this point, you can define you own parsing functions which must comply with **Rules For Writing Parsing Functions For Devices Data**.
4. Expose the **ParsingFunctions** class to user defined functions by passing a list of functions to the **ParsingFunctions.add_new_parsers()** method.
5. Load the **devices_data.txt** file by passing the file path to **Device.read_devices_json()**. This loads in the JSON file as a Python dictionary.
6. A list of device objects can be created for all the makers by providing the **Device.create_devices_from_data()** method with the dictionary obtained in step 5.
7. To make it easy to manipulate the data once it has been parsed, the **Device.create_df()** method has been defined, which creates a pandas DataFrame from *devices_data*. As an argument, this method takes in the dict object which was loaded in step 5 and applies **Device.create_devices_from_data()**. An internal dictionary called **all_features_dict** is used to keep track of all the features collected from all the devices. For example, flagship devices released in 2018 and 2019 tend to have three main cameras, which is denoted by the feature *main_camera_triple*. This method is written such that a  device feature listed in the keys of **all_features_dict** is given a value and np.NaN, otherwise.

In [0]:
# assuming that you have deviceparser.py in your working directory, load it as follows
from deviceparser import Device, ParsingFunctions

In [0]:
# define your parsing functions 
class MyParsers:
  '''A container class for housing all the parsing functions'''
  
  # define the parsing functions
  def parse_banner_batsize_hl(spec_value):
    '''A function for parsing the batsize_hl

       Takes in str '3800' and returns the np.floats64 3800
    '''
    return {'batsize':np.float64(spec_value)}


  def parse_banner_displayres_hl(spec_value):
    '''A function for parsing the displayres_hl

       Takes in str '1280x1920 pixels' and returns two np.float64's for the length and height of the screen
    '''
    res_pattern = re.compile('(\d+)x(\d+)')
    return_val =  re.findall(res_pattern, spec_value)
    feature_names = ['displayres_len', 'displayres_height']
    if not return_val:
      return {'displayres_len':np.NaN, 'displayres_height':np.NaN}
    else:
      return  {name:np.float64(i) for name, i  in zip(feature_names, return_val[0])}


  def parse_banner_ramsize_hl(spec_value):
      '''A function for parsing the ramsize_hl

         Takes in str '2' and returns the np.float64 2
      '''
      return {'ramsize':np.float64(spec_value)}


  def parse_banner_displaysize_hl(spec_value):
      '''A function for parsing the displaysize_hl

         Takes in str '6.0"' and returns np.float64 6.0
      '''
      disp_size_pattern = re.compile('(^\d+\.?\d*)\s?"')
      return_val = re.findall(disp_size_pattern, spec_value)
      if not return_val:
        return {'displaysize':np.NaN}
      else:
        return {'displaysize':np.float64(return_val[0])}


  def parse_body_weight(spec_value):
    '''A function for parsing body_weight in devices_data
       Takes in string '165 g ((9.17 oz))' and returns weight in gm as np.float64
    '''
    weight_pattern = re.compile('(\d+\.?\d?)\s?(?:g|gm|gram|grms|grams)')
    weight = re.findall(weight_pattern, spec_value)
    if not weight:
      return {'weight':np.NaN}
    else:
      feature_value = np.float64(weight[0])    
      return {'weight':feature_value}


  # examples of launch_status
  # 1. Available. Released 2018, July
  # 2. Coming soon. Exp. release 2019, Q1
  # 3. Cancelled
  def parse_launch_status(spec_value):
    '''A function for parsing the launch_status

       Returns str 'Available' or 'Coming soon' or 'Cancelled'

    '''
    status_patten = re.compile('(Available|Coming\s?soon|Cancelled|Discontinued).*')
    status = re.findall(status_patten, spec_value)
    if not status:
      return {'launch_status':np.NaN}
    else:
      return {'launch_status':status[0]}


  def parse_launch_announced(spec_value):
    '''A function for parsing the release date of a phone

       Returns a pandas.Timestamp object
    '''
    year_month_pattern = re.compile('\d{4}[\s,.]*(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|\
                                    Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|\
                                    Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)')
    year_only_pattern = re.compile('\d{4}')

    # finding year, month gets hightest priority
    year_month = re.findall(year_month_pattern, spec_value)

    if not year_month:
      year = re.findall(year_only_pattern, spec_value)
      # if year is also not present then set both month and year to np.NaN
      if not year:
        dt = np.NaN
      else:
        dt = year[0]
    else:
      dt = year_month[0]

    return {'launch_announced':pd.to_datetime(dt)}


  def parse_body_dimensions(spec_value):
    '''A function for parsing the body dimensions

       Returns the value of each dimension as a np.float64. Returns np.NaN if dimensions not found
    '''
    dimensions_pattern = re.compile('((?:\d+\.?\d*[\sx]+){3})mm')
    float_pattern = re.compile('\d+\.?\d*')

    dimensions = re.findall(dimensions_pattern, spec_value)
    if not dimensions:
      return {'body_x':np.NaN, 'body_y':np.NaN, 'body_z':np.NaN}
    else:
      # find each of the dimensions
      each_dim = re.findall(float_pattern, dimensions[0])
      x = np.float64(each_dim[0])
      y = np.float64(each_dim[1])
      z = np.float64(each_dim[2])
      return {'body_x':x, 'body_y':y, 'body_z':z}


  # different types of displays noticed
  # 1. OLED
  # 2. LCD 
  # 3. TFT - variant of LCD
  # 4. IPS - variant of LCD
  # 5. Monochrome|Grayscale
  def parse_display_type(spec_value):
    '''A function for categorizing the display types of devices

       Returns str category 'OLED', 'LCD', 'TFT', 'IPS' or 'MONO'
    '''
    # if this is not a known display type return NaN 
    display_type_category = np.NaN

    if re.compile('.*OLED.*').match(spec_value):
      display_type_category = 'OLED'
    elif re.compile('.*LCD.*').match(spec_value):
      display_type_category = 'LCD'
    elif re.compile('.*TFT.*').match(spec_value):
      display_type_category = 'TFT'
    elif re.compile('.*IPS.*').match(spec_value):
      display_type_category = 'IPS'
    elif re.compile('.*[Mm]onochrome.*|.*[Gg]rayscale.*').match(spec_value):
      display_type_category = 'MONO'
    return {'display_type':spec_value, 'display_type_category':display_type_category}


  # values of battery_talk_talk
  # 1. Up to 4 h 30 min
  # 2. Up to 4 h
  # 3. Up to 7 h 30 min (multimedia)
  # 4. Up to 15 h 20 min (2G) / Up to 6 h (3G)
  def parse_battery_talk_time(spec_value):
    '''A function for parsing a device's talk time

       Returns three pd.TimeDelta objects
       If 3G or 2G talktimes are reported, they get a value, else np.NaN
       If neither 3G nor 2g talktimes are reported then make an attempt to find a generic talktime
    '''
    # output variables
    talk_time_3g_td = np.NaN
    talk_time_2g_td = np.NaN
    talk_time_td = np.NaN


    talk_time_3g_pattern = re.compile('.*Up\sto\s((?:\d+\sh)?(?:\s?\d+\smin)?)\s\(3G\)')
    talk_time_2g_pattern = re.compile('.*Up\sto\s((?:\d+\sh)?(?:\s?\d+\smin)?)\s\(2G\)')
    talk_time_pattern = re.compile('.*Up\sto\s((?:\d+\sh)?(?:\s?\d+\smin)?)')

    talk_time_3g = re.findall(talk_time_3g_pattern, spec_value) 
    talk_time_2g = re.findall(talk_time_2g_pattern, spec_value) 

    if talk_time_3g:
      talk_time_3g_td = pd.to_timedelta(talk_time_3g[0])

    if talk_time_2g:
        talk_time_2g_td = pd.to_timedelta(talk_time_2g[0])

    # if this device has neither 2g or 3g talk time, try finding a generic talktime
    if (not talk_time_3g) and (not talk_time_2g):
      talk_time = re.findall(talk_time_pattern, spec_value)
      if talk_time:
        talk_time_td = pd.to_timedelta(talk_time[0])

    return {'talk_time':talk_time_td, 'talk_time_2g':talk_time_2g_td, 'talk_time_3g':talk_time_3g_td}


  def parse_platform_chipset(spec_value):
    '''A function for parsing the device chipset

       Returns silicon gate width (x nm) as np.float64 along with chipset name
    '''
    gate_width_pattern = re.compile('\((\d+)\snm\)')
    gate_width = re.findall(gate_width_pattern, spec_value)

    if gate_width:
      return {'platform_chipset':spec_value, 'platform_chipset_gate_width':np.float64(gate_width[0])}
    else:
      return {'platform_chipset':spec_value, 'platform_chipset_gate_width':np.NaN}

In [0]:
# current version of ParsingFunctions does not come preloaded with any methods 
print(ParsingFunctions.__dict__)

# ParsingFunctions can be provided a single parsing functions or a list of parsing functions
ParsingFunctions.add_new_parsers(MyParsers.parse_banner_batsize_hl)
print(ParsingFunctions.__dict__)

ParsingFunctions.add_new_parsers([MyParsers.parse_banner_displayres_hl, \
                                  MyParsers.parse_banner_ramsize_hl, \
                                  MyParsers.parse_banner_displaysize_hl
                                 ])

print(ParsingFunctions.__dict__)

# only one parsing function of a given name can exist at a time inide ParsingFunctions
# a warning message is printed out if you try to reassign a function which already exists
# use clear_existing_parser() method if you need to reimport a parser of the same name
ParsingFunctions.clear_existing_parsers()

# defining user defined functions inside a class makes it easy to obtain the functions as a list
parsing_functions_list = [value for key, value in MyParsers.__dict__.items() if re.compile('parse_.*').match(key)]

ParsingFunctions.add_new_parsers(parsing_functions_list)


{'__module__': '__main__', '__doc__': 'A class for housing all the parsing function which will be used on device specs data', 'allow_parsing': [], 'parse_spec': <function ParsingFunctions.parse_spec at 0x7dc5ca279510>, 'add_new_parsers': <classmethod object at 0x7dc5b41080b8>, 'clear_existing_parsers': <classmethod object at 0x7dc5b4108240>, '__dict__': <attribute '__dict__' of 'ParsingFunctions' objects>, '__weakref__': <attribute '__weakref__' of 'ParsingFunctions' objects>, 'parse_banner_batsize_hl': <function MyParsers.parse_banner_batsize_hl at 0x7dc5b4895048>, 'parse_banner_displayres_hl': <function MyParsers.parse_banner_displayres_hl at 0x7dc5b4895488>, 'parse_banner_ramsize_hl': <function MyParsers.parse_banner_ramsize_hl at 0x7dc5b493e598>, 'parse_banner_displaysize_hl': <function MyParsers.parse_banner_displaysize_hl at 0x7dc5b493e6a8>, 'parse_body_weight': <function MyParsers.parse_body_weight at 0x7dc5b493e400>, 'parse_launch_status': <function MyParsers.parse_launch_statu

In [0]:
# loading devices_data
file_path = '../devices_data.txt'
devices_dict = Device.read_devices_json(file_path)
devices_dict.keys()

dict_keys(['Acer', 'alcatel', 'Allview', 'Amazon', 'Amoi', 'Apple', 'Archos', 'Asus', 'AT&T;', 'Benefon', 'BenQ', 'BenQ-Siemens', 'Bird', 'BlackBerry', 'Blackview', 'BLU', 'Bosch', 'BQ', 'Casio', 'Cat', 'Celkon', 'Chea', 'Coolpad', 'Dell', 'Emporia', 'Energizer', 'Ericsson', 'Eten', 'Fujitsu Siemens', 'Garmin-Asus', 'Gigabyte', 'Gionee', 'Google', 'Haier', 'Honor', 'HP', 'HTC', 'Huawei', 'i-mate', 'i-mobile', 'Icemobile', 'Infinix', 'Innostream', 'iNQ', 'Intex', 'Jolla', 'Karbonn', 'Kyocera', 'Lava', 'LeEco', 'Lenovo', 'LG', 'Maxon', 'Maxwest', 'Meizu', 'Micromax', 'Microsoft', 'Mitac', 'Mitsubishi', 'Modu', 'Motorola', 'MWg', 'NEC', 'Neonode', 'NIU', 'Nokia', 'Nvidia', 'O2', 'OnePlus', 'Oppo', 'Orange', 'Palm', 'Panasonic', 'Pantech', 'Parla', 'Philips', 'Plum', 'Posh', 'Prestigio', 'QMobile', 'Qtek', 'Razer', 'Realme', 'Sagem', 'Samsung', 'Sendo', 'Sewon', 'Sharp', 'Siemens', 'Sonim', 'Sony', 'Sony Ericsson', 'Spice', 'T-Mobile', 'TECNO', 'Tel.Me.', 'Telit', 'Thuraya', 'Toshiba', 'Un

In [0]:
# access maker_id and maker_name using Device.list_makers() method
makers = Device.list_makers(devices_dict)
print(makers)

[(0, 'Acer'), (1, 'alcatel'), (2, 'Allview'), (3, 'Amazon'), (4, 'Amoi'), (5, 'Apple'), (6, 'Archos'), (7, 'Asus'), (8, 'AT&T;'), (9, 'Benefon'), (10, 'BenQ'), (11, 'BenQ-Siemens'), (12, 'Bird'), (13, 'BlackBerry'), (14, 'Blackview'), (15, 'BLU'), (16, 'Bosch'), (17, 'BQ'), (18, 'Casio'), (19, 'Cat'), (20, 'Celkon'), (21, 'Chea'), (22, 'Coolpad'), (23, 'Dell'), (24, 'Emporia'), (25, 'Energizer'), (26, 'Ericsson'), (27, 'Eten'), (28, 'Fujitsu Siemens'), (29, 'Garmin-Asus'), (30, 'Gigabyte'), (31, 'Gionee'), (32, 'Google'), (33, 'Haier'), (34, 'Honor'), (35, 'HP'), (36, 'HTC'), (37, 'Huawei'), (38, 'i-mate'), (39, 'i-mobile'), (40, 'Icemobile'), (41, 'Infinix'), (42, 'Innostream'), (43, 'iNQ'), (44, 'Intex'), (45, 'Jolla'), (46, 'Karbonn'), (47, 'Kyocera'), (48, 'Lava'), (49, 'LeEco'), (50, 'Lenovo'), (51, 'LG'), (52, 'Maxon'), (53, 'Maxwest'), (54, 'Meizu'), (55, 'Micromax'), (56, 'Microsoft'), (57, 'Mitac'), (58, 'Mitsubishi'), (59, 'Modu'), (60, 'Motorola'), (61, 'MWg'), (62, 'NEC'), 

In [0]:
# parsing a single device object

maker_name = 'Samsung'
# get Samsung's maker_id form the list we just created
for maker in makers:
  if maker[1] == maker_name:
    maker_id = maker[0]
    
# get the data for Galaxy S10 from the devices_dict 
device_name = 'Galaxy S10'
for device_num, device in devices_dict[maker_name].items():
  if device['device_name'] == device_name:
    device_id = maker_name.upper() + '_' + device_num 
    break
    
# create the device object by initializng an instance of the Device class
s10 = Device(device, device_id, maker_name, maker_id)

# new features have been created using the parsing functions
print(s10.batsize)
print(s10.displayres_height)
print(s10.displayres_len)
print(s10.displaysize)
print(s10.ramsize)

3400.0
3040.0
1440.0
6.1
8.0


In [0]:
# creating a list of device objects for all makers
devices_collector = Device.create_devices_from_data(devices_dict)
print(len(devices_collector))
print(type(devices_collector[0]))

9536
<class '__main__.Device'>


In [0]:
# simple one line filters can be written for devices collector using list comprehensions
# use getattr() for getting the value of a feature instead of self.feature_name
# this ensures that an AttributeError is not thrown if the device does not have the attribute 
samsung_devices = [x for x in devices_collector if getattr(x, 'maker_name', None) == 'Samsung']
print(len(samsung_devices) == len(devices_dict['Samsung'].keys()))

s10 = [x for x in devices_collector if getattr(x, 'device_name', None) == 'Galaxy S10']
print(s10[0].device_name)

batsize = [x.batsize if getattr(x, 'batsize', None) is not None else np.NaN for x in devices_collector]
print(batsize)

# this might seem complicated if you are beginner, but it is a great opportunity to learn about list comprehensions

True
Galaxy S10
[4500.0, 3400.0, 4080.0, 2000.0, 6100.0, 4020.0, 3000.0, 5000.0, 2000.0, 4420.0, 2870.0, 2000.0, 2000.0, 4000.0, 4000.0, 2420.0, 2420.0, 2000.0, 2000.0, 5910.0, 4550.0, 5910.0, 5910.0, 2300.0, 2000.0, 1300.0, 1300.0, 2000.0, 2300.0, 2000.0, 2700.0, 2100.0, 3500.0, 2500.0, 1300.0, 4600.0, 3400.0, 3400.0, 3700.0, 2000.0, 2500.0, 1630.0, 2955.0, 2955.0, 4000.0, 2000.0, 3300.0, 1500.0, 2400.0, 7300.0, 4960.0, 4960.0, 2000.0, 1300.0, 2000.0, 1760.0, 2640.0, 2710.0, 3420.0, 1300.0, 1500.0, 1500.0, 1300.0, 1460.0, 3260.0, 3260.0, 9800.0, 9800.0, 9800.0, 9800.0, 1300.0, nan, 3260.0, 1500.0, 3260.0, 1530.0, 1530.0, 1300.0, 1300.0, 1300.0, 1500.0, 1500.0, 1500.0, 1500.0, 1400.0, 1350.0, 1090.0, 1090.0, 970.0, 1500.0, 1350.0, 1350.0, 1140.0, 1140.0, 1140.0, 1260.0, 1530.0, 1530.0, 1530.0, 1530.0, 4080.0, 3500.0, 3500.0, 3060.0, 3000.0, 2460.0, 2050.0, 4000.0, 4000.0, 3000.0, 3000.0, 3000.0, 3000.0, 2460.0, 3000.0, 4000.0, 2580.0, 2850.0, 2800.0, 4000.0, 4000.0, 2050.0, 2800.0, 262

In [0]:
# create a DataFrame after applying all the user defined parsing functions to the devices data
df = Device.create_df(devices_dict)
df.head()

Unnamed: 0,device_notes,maker_name,maker_id,device_id,device_name,device_info,device_img_link,device_link,device_specs,network_technology,...,features_messaging,features_games,features_java,sound_alert_types,features_clock,features_alarm,features_languages,main_camera_quad,selfie_camera_triple,main_camera_five
0,"{'body': '- Stylus', 'battery': 'Non-removable...",Acer,,ACER_0,Chromebook Tab 10,Acer Chromebook Tab 10 tablet. Announced Mar 2...,https://cdn2.gsmarena.com/vv/bigpic/acer-chrom...,https://www.gsmarena.com/acer_chromebook_tab_1...,{'Network': {'Technology': 'No cellular connec...,No cellular connectivity,...,,,,,,,,,,
1,"{'network': 'LTE band 3(1800), 7(2600), 8(900)...",Acer,,ACER_1,Iconia Talk S,Acer Iconia Talk S Android tablet. Announced A...,https://cdn2.gsmarena.com/vv/bigpic/acer-iconi...,https://www.gsmarena.com/acer_iconia_talk_s-83...,"{'Network': {'Technology': 'GSM / HSPA / LTE',...",GSM / HSPA / LTE,...,,,,,,,,,,
2,{'battery': 'Non-removable Li-Po 4080 mAh batt...,Acer,,ACER_2,Liquid Z6 Plus,Acer Liquid Z6 Plus Android smartphone. Announ...,https://cdn2.gsmarena.com/vv/bigpic/acer-liqui...,https://www.gsmarena.com/acer_liquid_z6_plus-8...,"{'Network': {'Technology': 'GSM / HSPA / LTE',...",GSM / HSPA / LTE,...,,,,,,,,,,
3,{'battery': 'Removable Li-Ion 2000 mAh battery'},Acer,,ACER_3,Liquid Z6,Acer Liquid Z6 Android smartphone. Announced A...,https://cdn2.gsmarena.com/vv/bigpic/acer-liqui...,https://www.gsmarena.com/acer_liquid_z6-8304.php,"{'Network': {'Technology': 'GSM / HSPA / LTE',...",GSM / HSPA / LTE,...,,,,,,,,,,
4,"{'sound': '- DTS HD sound', 'features': '- HDM...",Acer,,ACER_4,Iconia Tab 10 A3-A40,Acer Iconia Tab 10 A3-A40 Android tablet. Anno...,https://cdn2.gsmarena.com/vv/bigpic/acer-iconi...,https://www.gsmarena.com/acer_iconia_tab_10_a3...,{'Network': {'Technology': 'No cellular connec...,No cellular connectivity,...,,,,,,,,,,


In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9536 entries, 0 to 9535
Data columns (total 100 columns):
device_notes                   9535 non-null object
maker_name                     9536 non-null object
maker_id                       9436 non-null float64
device_id                      9536 non-null object
device_name                    9536 non-null object
device_info                    9536 non-null object
device_img_link                9536 non-null object
device_link                    9536 non-null object
device_specs                   9536 non-null object
network_technology             9536 non-null object
network_2g_bands               9521 non-null object
network_3g_bands               5728 non-null object
network_4g_bands               2392 non-null object
network_speed                  5755 non-null object
network_gprs                   9504 non-null object
network_edge                   9513 non-null object
launch_announced               9459 non-null datetime64[ns]

In [0]:
df.to_csv('devices_data_full.csv')