#  WebScraping Latest Mobile-Devices from GSMArena

## Using Python, Requests and Beautifulsoup4

![](https://i.imgur.com/9Rv18FX.jpg)

# Introduction : 

```GSMArena.com``` is an online information database and index of mobile phones, reviews, news, specifications, and more that offers reviews of electronic devices like mobile phones, tablets, and other features like comparison and finder. This service is provided free of charge by Gsmarena.com.

In addition to updated latest data, the website has information on every mobile handset launched by every brand. It is Wikipedia of mobile handsets.

One can get any information about every mobile. We are going to scrape every brand that existed in the mobile device category and get the latest launched mobile details for each brand.

This project is a part of Data Science and Machine learning Bootcamp Course hosted in Jovian.

Scraping a website involves fetching a page and extracting data from it. When a user views a page, the browser downloads it.

A number of tools/libraries are available for scraping the information.

Here in this project we will Web Scrape GSMArena website using Python libraries like Requests and BeautifulSoup4. We will build many functions to extract a handful of features to build our dataset. We will write the dataset information in the CSV format.

We will scrape GSMArena Website to extract data for the All Brands and latest launched mobiles in the world.



# Outline of the Project:

- Download GSMArena Web Page.

- Parse the HTML code of webpage.

- Extract the data from the parsed HTML page in dictionary.

- Create a list of data in dictionaries.

- Write data to CSV using Pandas DataFrame.

# Project Steps : 

- Project is Divided into two parts. 

- Part One - Getting Brands Existed
- Part Two - Scrape latest launched devices for each Brand



1)  Getting Brands Existed

- Download the GSMArena Makers(Brands) WebPage using requests library.

- Parse the webpage using BeautifulSoup4. 

- WebPage contains information of 119 Brands.

- We are Going to Extract the following data from each brand in csv format :

- For that we are creating list as :

``` Brand_Name``` , ```No.of.devices.launched ``` , ```Brand's Url```

- Using python dict , we send these list as data.

- By using Pandas we send the dict , which then converted into CSV


2) Getting latest launched devices for each Brand

- With the help of requests library download the Brand webpage using the URL from Brand URL list

- Parse the Brand WebPage using Beautiful Soup Library.

- Write functions to extract information from the Parsed HTML Page. 

- The data extracted for each Brand include:

```device_name```, ```announced date```,```features```,```device_url```

- Using python dict , we send these list as data.

- By using Pandas we send the dict , which then converted into CSV

## Installation Of Required Python Libraries for Project

In [1]:
!pip install beautifulsoup4 --quiet
!pip install requests --quiet

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import os


- We will be using requests , Beautifulsoup4 , pandas , os libraries for Web Scrapping. 

## Extracting the Page

- With the help of reusable functions we are completing the code
- Download GSMArena Web Page.

- Parse the HTML code of webpage.

In [1]:
def get_page():
    
    # GSMArena Mobiles Url
    mobiles_url = 'https://www.gsmarena.com/makers.php3'
    
    # With help of Request Libarary we are getting the page 
    response = requests.get(mobiles_url,headers = {'User-agent': 'Super Bot Power Level Over 9000'}) 
    
    #Checking Whether It is Successfully Loaded or Not
    if response.status_code != 200:
        raise Exception('Failed to load data {}'.format(mobiles_url))
    
    #With Beautiful Soup Library parsing the response
    doc = BeautifulSoup(response.text, 'html.parser')
    
    #Returning the Page in doc 
    return doc
    

In [6]:
doc = get_page() #calling the get_page function

In [7]:
# It is a Beautiful Soup Object
type(doc)

bs4.BeautifulSoup

## First Part Of The Problem - Getting All Brands Dictionary

- Part One of The Project Requires Two Functions :

### 1. scrape_brands()

- we will be  defining a function called scrape_brands() which will scrape all the required tags that are necessary for building the project. All the tags for each brand is filtered out.

### 2. get_brands_dict()

- we will be using a helper function  called get_brands_dict() for creating the dictionary of each brand having attributes like BrandName,DevicesCount,BrandUrl



- The Building Of functions will be done step by step and at the end of each part the final functions willbe created

1) Getting Brand Name
2) Number of Devices Launched
3) Brands Url


In [5]:
mobiles_url = 'https://www.gsmarena.com/makers.php3' # Loading URL

In [6]:
response = requests.get(mobiles_url) 

In [7]:
response.status_code # if it gives a status code which is 200 - > Success

200

In [12]:
page_contents = response.text  # Page Downloaded Successfully

In [13]:
page_contents[:1000]  #First 1000 characters

'<!doctype html>\r\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US">\r\n<head>\r\n<title>List of all mobile phone brands - GSMArena.com</title>\r\n<script>\r\nDESKTOP_BASE_URL = "https://www.gsmarena.com/";\r\nMOBILE_BASE_URL = "https://m.gsmarena.com/";\r\nASSETS_BASE_URL  = "https://fdn.gsmarena.com/vv/assets12/";\r\nCDN_BASE_URL = "//fdn.gsmarena.com/";\r\nCDN2_BASE_URL = "//fdn2.gsmarena.com/";\r\n</script>\r\n\r\n\r\n<meta charset="utf-8">\r\n<meta name="viewport" content="width=1060, initial-scale=1.0">\r\n<link rel="stylesheet" href="https://fdn.gsmarena.com/vv/assets12/css/gsmarena.css?v=64">\r\n\r\n<link rel="shortcut icon" href="https://fdn.gsmarena.com/imgroot/static/favicon.ico">\r\n\r\n\r\n<script>\r\nwindow["pgGlobalSettings"] = {\r\n    "global": {\r\n        "strategy": "include"\r\n    },\r\n    "adUnits": []\r\n};\r\n\r\nfunction addAdUnit(adUnitCode) {\r\n    window.pgGlobalSettings.adUnits.push({ "adUnitCode": adUnitCode });\r\n};\r\n\r\n\t\

In [14]:
with open('mobilepage.html' , 'w') as f:
    f.write(page_contents)

Image Inside File Explorer -- WebPage Downloaded As MobilePage.html

![](https://i.imgur.com/7XFnu8Z.jpg)

GSM Arena Brand Page Downloaded

![](https://i.imgur.com/bemeAZE.jpg)

In [16]:
doc = BeautifulSoup(page_contents,'html.parser') # beautifulsoup takes page contents and parses as html

![](https://i.imgur.com/2PTfaeM.png)

- After Inspecting the elements , It has been found that td tag has all the required Data

In [17]:
mobile_brand_tags = doc.find_all('td')  

In [18]:
len(mobile_brand_tags)

119

There Exist 119 Brands of Mobile launched in The Industry

Lets Get First Device Brand in the Website

In [19]:
first_device=mobile_brand_tags[0]
first_device

<td><a href="acer-phones-59.php">Acer<br/><span>100 devices</span></a></td>

Let us get the Brand Name from the Tag

In [20]:
first_device.find('br').previousSibling

'Acer'

So We Got the First device Brand as Acer

Now Let us get the brand names of all the devices in the website. We will create a list for it

In [21]:
Brands = []
for tag in mobile_brand_tags:
    Brands.append(tag.find('br').previousSibling)

In [23]:
print(Brands[:20]) # priniting first 20 Brands of the Mobiles

['Acer', 'alcatel', 'Allview', 'Amazon', 'Amoi', 'Apple', 'Archos', 'Asus', 'AT&T', 'Benefon', 'BenQ', 'BenQ-Siemens', 'Bird', 'BlackBerry', 'Blackview', 'BLU', 'Bosch', 'BQ', 'Casio', 'Cat']


We have got all the brand names in brand list

In the same fashion lets get the url for the first device

In [24]:
first_device.find('a')['href']

'acer-phones-59.php'

complete_url = root_url(base_url) + url_retrived_from_scrapping

Now lets  get the number of devices for the first_device

In [25]:
first_device.find('span').text.strip()

'100 devices'

Looks like First brand Acer has launched 100 devices until now

Lets get the Number of devices Launched By Every Brand in the industry from website

In [26]:
Devices = []
for tag in mobile_brand_tags:
    Devices.append(tag.find('span').text.strip())

In [27]:
print(Devices[:20])

['100 devices', '407 devices', '157 devices', '22 devices', '47 devices', '108 devices', '43 devices', '196 devices', '4 devices', '9 devices', '35 devices', '27 devices', '61 devices', '92 devices', '45 devices', '360 devices', '10 devices', '20 devices', '5 devices', '21 devices']


We have got the Information of Every Brand Number of Devices launched 

We found the url and when it is appended to base url the complete url is displayed

Lets get the Url for every Brand and create list for urls

In [28]:
base_url = "https://www.gsmarena.com/"
Urls = []
for tag in mobile_brand_tags:
    Urls.append(base_url +tag.find('a')['href'])

We found the URL's  every brand and appended in the list.

In [30]:
Urls[:20]

['https://www.gsmarena.com/acer-phones-59.php',
 'https://www.gsmarena.com/alcatel-phones-5.php',
 'https://www.gsmarena.com/allview-phones-88.php',
 'https://www.gsmarena.com/amazon-phones-76.php',
 'https://www.gsmarena.com/amoi-phones-28.php',
 'https://www.gsmarena.com/apple-phones-48.php',
 'https://www.gsmarena.com/archos-phones-90.php',
 'https://www.gsmarena.com/asus-phones-46.php',
 'https://www.gsmarena.com/at&t-phones-57.php',
 'https://www.gsmarena.com/benefon-phones-15.php',
 'https://www.gsmarena.com/benq-phones-31.php',
 'https://www.gsmarena.com/benq_siemens-phones-42.php',
 'https://www.gsmarena.com/bird-phones-34.php',
 'https://www.gsmarena.com/blackberry-phones-36.php',
 'https://www.gsmarena.com/blackview-phones-116.php',
 'https://www.gsmarena.com/blu-phones-67.php',
 'https://www.gsmarena.com/bosch-phones-10.php',
 'https://www.gsmarena.com/bq-phones-108.php',
 'https://www.gsmarena.com/casio-phones-77.php',
 'https://www.gsmarena.com/cat-phones-89.php']

Lets Create a CSV File (Brand_Name,Devices_Launched,Brand_Url) from Pandas Data Frame

Firstly we will create the dictionary out of lists previously created using Brands , Devices , Urls

In [31]:
mobiles_dict = {
    
    'Brands' : Brands,
    'Number_Of_Devices':Devices,
    'Urls':Urls
    
}

In [32]:
mobiles_dict

{'Brands': ['Acer',
  'alcatel',
  'Allview',
  'Amazon',
  'Amoi',
  'Apple',
  'Archos',
  'Asus',
  'AT&T',
  'Benefon',
  'BenQ',
  'BenQ-Siemens',
  'Bird',
  'BlackBerry',
  'Blackview',
  'BLU',
  'Bosch',
  'BQ',
  'Casio',
  'Cat',
  'Celkon',
  'Chea',
  'Coolpad',
  'Dell',
  'Doogee',
  'Emporia',
  'Energizer',
  'Ericsson',
  'Eten',
  'Fairphone',
  'Fujitsu Siemens',
  'Garmin-Asus',
  'Gigabyte',
  'Gionee',
  'Google',
  'Haier',
  'Honor',
  'HP',
  'HTC',
  'Huawei',
  'i-mate',
  'i-mobile',
  'Icemobile',
  'Infinix',
  'Innostream',
  'iNQ',
  'Intex',
  'Jolla',
  'Karbonn',
  'Kyocera',
  'Lava',
  'LeEco',
  'Lenovo',
  'LG',
  'Maxon',
  'Maxwest',
  'Meizu',
  'Micromax',
  'Microsoft',
  'Mitac',
  'Mitsubishi',
  'Modu',
  'Motorola',
  'MWg',
  'NEC',
  'Neonode',
  'NIU',
  'Nokia',
  'Nothing',
  'Nvidia',
  'O2',
  'OnePlus',
  'Oppo',
  'Orange',
  'Palm',
  'Panasonic',
  'Pantech',
  'Parla',
  'Philips',
  'Plum',
  'Posh',
  'Prestigio',
  'QMobil

pandas is library which has DataFrame a function which takes dictionary and return a data frame

In [33]:
mobiles_df = pd.DataFrame(mobiles_dict) 

In [34]:
mobiles_df

Unnamed: 0,Brands,Number_Of_Devices,Urls
0,Acer,100 devices,https://www.gsmarena.com/acer-phones-59.php
1,alcatel,407 devices,https://www.gsmarena.com/alcatel-phones-5.php
2,Allview,157 devices,https://www.gsmarena.com/allview-phones-88.php
3,Amazon,22 devices,https://www.gsmarena.com/amazon-phones-76.php
4,Amoi,47 devices,https://www.gsmarena.com/amoi-phones-28.php
...,...,...,...
114,XOLO,81 devices,https://www.gsmarena.com/xolo-phones-85.php
115,Yezz,106 devices,https://www.gsmarena.com/yezz-phones-78.php
116,Yota,3 devices,https://www.gsmarena.com/yota-phones-99.php
117,YU,13 devices,https://www.gsmarena.com/yu-phones-100.php


Its has given Beautiful table out of the lists having data includes Brands , NumberOfDevices and URL's

'to_csv converts dataframe to csv. When index = None given as a parameter, 
it eliminates the index col in csv file

In [37]:
mobiles_df.to_csv('brands.csv',index=None) 

![](https://i.imgur.com/vhoKTFJ.png)

![](https://i.imgur.com/F6pljVN.png)

Part 1 Of The Problem Is Solved

This all code can be put in function for Reusability

The whole code is divided into two blocks or two functions

In [3]:
def get_brands_dict(mobile_brand_tags):
    
    #list created for col data
    
    Brands =  []
    Devices = []
    Urls = []
    base_url = "https://www.gsmarena.com/" #target url
    
    for tag in mobile_brand_tags:
        Brands.append(tag.find('br').previousSibling) # extracting brand name
        Devices.append(tag.find('span').text.strip())  # extracting number of devices
        Urls.append(base_url +tag.find('a')['href']) #  extracting complete URL's
    
    #creating dictionary out of list extracted
    mobiles_dict = { 'Brands' : Brands, 'Number_Of_Devices':Devices, 'Urls':Urls } 
    
    #returning dictionary created
    return mobiles_dict

In [4]:
def scrape_page(url):
    
    # GSMArena Mobiles Url
    mobiles_url = url
    
    # With help of Request Libarary we are getting the page 
    response = requests.get(mobiles_url,headers = {'User-agent': 'Super Bot Power Level Over 9000'}) 
    
    #Checking Whether It is Successfully Loaded or Not
    if response.status_code != 200:
        raise Exception('Failed to load data {}'.format(mobiles_url))
    
    #With Beautiful Soup Library parsing the response
    doc = BeautifulSoup(response.text, 'html.parser')
    
    #Returning the Page in doc 
    return doc
    

In [5]:
def scrape_brands():
    mobiles_url = 'https://www.gsmarena.com/makers.php3' #target URL

    #page_contents = scrape_page(mobiles_url) #loading the webpage in variable
    #doc = BeautifulSoup(page_contents,'html.parser') #parsing the webpage in beautiful soup
    doc =  scrape_page(mobiles_url)
    mobile_brand_tags = doc.find_all('td')  # finding the td tags for extracting useful data
    
    
    mobiles_dict = get_brands_dict(mobile_brand_tags) #getting the extracted data in dictionary
    
    brands_df = pd.DataFrame(mobiles_dict) #result in dataFrame from dictionary
    
    return brands_df 

So When i Call the Function scrape_brands() it will give us data frame of brands

In [8]:
scrape_brands()

Unnamed: 0,Brands,Number_Of_Devices,Urls
0,Acer,100 devices,https://www.gsmarena.com/acer-phones-59.php
1,alcatel,407 devices,https://www.gsmarena.com/alcatel-phones-5.php
2,Allview,157 devices,https://www.gsmarena.com/allview-phones-88.php
3,Amazon,22 devices,https://www.gsmarena.com/amazon-phones-76.php
4,Amoi,47 devices,https://www.gsmarena.com/amoi-phones-28.php
...,...,...,...
114,XOLO,81 devices,https://www.gsmarena.com/xolo-phones-85.php
115,Yezz,106 devices,https://www.gsmarena.com/yezz-phones-78.php
116,Yota,3 devices,https://www.gsmarena.com/yota-phones-99.php
117,YU,13 devices,https://www.gsmarena.com/yu-phones-100.php


## Second Part Of the Problem

Lets us Solve The ```Second Part``` of the problem
Here we need to extract the all mobiles details launched by each brand
That includes ```Device Name , Device Launch Date , Device Features and Device URL```

The input for second part of the program is the brand URL from URL's list

Lets Get the brand Page

In [43]:
brand_page = Urls[5] # for explanation getting the apple brand url

In [44]:
print(brand_page)

https://www.gsmarena.com/apple-phones-48.php


In [45]:
response = requests.get(brand_page)

In [46]:
response.status_code # if it gives a status code which is 200 - > Success

200

In [47]:
len(response.text)

35363

In [50]:
brand_doc = BeautifulSoup(response.text,'html.parser') # parsing file using beautiful soup

In [51]:
brand_doc


<!DOCTYPE html>

<html lang="en-US" xml:lang="en-US" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>All Apple phones</title>
<script>
DESKTOP_BASE_URL = "https://www.gsmarena.com/";
MOBILE_BASE_URL = "https://m.gsmarena.com/";
ASSETS_BASE_URL  = "https://fdn.gsmarena.com/vv/assets12/";
CDN_BASE_URL = "//fdn.gsmarena.com/";
CDN2_BASE_URL = "//fdn2.gsmarena.com/";
</script>
<meta charset="utf-8"/>
<meta content="width=1060, initial-scale=1.0" name="viewport"/>
<link href="https://fdn.gsmarena.com/vv/assets12/css/gsmarena.css?v=64" rel="stylesheet"/>
<link href="https://fdn.gsmarena.com/imgroot/static/favicon.ico" rel="shortcut icon"/>
<script>
window["pgGlobalSettings"] = {
    "global": {
        "strategy": "include"
    },
    "adUnits": []
};

function addAdUnit(adUnitCode) {
    window.pgGlobalSettings.adUnits.push({ "adUnitCode": adUnitCode });
};

	
	addAdUnit("/8095840,14566801/.2_A.34912.3_gsmarena.com_tier1");
	
	addAdUnit("/8095840,14566801/.2_A.35452.4_gsmarena.com_tier

Saving the file in html format

In [52]:
with open('Apple.html' , 'w') as f:
    f.write(response.text)

![](https://i.imgur.com/1jQECzy.png)

Image From The Website Downloaded Apple.html

![](https://i.imgur.com/e93IS7y.png)

![](https://i.imgur.com/ZZ7TO7c.png)

After inspecting all the data is present in div tag

In [53]:
brand_mobile_divison = brand_doc.find('div',class_='makers')

In [54]:
len(brand_mobile_divison)

5

In [55]:
brand_mobile_divison

<div class="makers">
<ul>
<li><a href="apple_iphone_14_pro_max-11773.php"><img src="https://fdn2.gsmarena.com/vv/bigpic/apple-iphone-14-pro-max-.jpg" title="Apple iPhone 14 Pro Max smartphone. Announced Sep 2022. Features 6.7″  display, Apple A16 Bionic chipset, 4323 mAh battery, 1024 GB storage, 6 GB RAM, Ceramic Shield glass."/><strong><span>iPhone 14 Pro Max</span></strong></a></li><li><a href="apple_iphone_14_pro-11860.php"><img src="https://fdn2.gsmarena.com/vv/bigpic/apple-iphone-14-pro.jpg" title="Apple iPhone 14 Pro smartphone. Announced Sep 2022. Features 6.1″  display, Apple A16 Bionic chipset, 3200 mAh battery, 1024 GB storage, 6 GB RAM, Ceramic Shield glass."/><strong><span>iPhone 14 Pro</span></strong></a></li><li><a href="apple_iphone_14_plus-11862.php"><img src="https://fdn2.gsmarena.com/vv/bigpic/apple-iphone-14-plus.jpg" title="Apple iPhone 14 Plus smartphone. Announced Sep 2022. Features 6.7″  display, Apple A15 Bionic chipset, 4323 mAh battery, 512 GB storage, 6 GB R

In [56]:
brand_mobile_tags = brand_mobile_divison.find_all('li')

In [57]:
len(brand_mobile_tags)

40

We have Found that apple brand made 40 mobiles launched in recent times

In [58]:
first_mobile=brand_mobile_tags[0] #first device under a brand
first_mobile

<li><a href="apple_iphone_14_pro_max-11773.php"><img src="https://fdn2.gsmarena.com/vv/bigpic/apple-iphone-14-pro-max-.jpg" title="Apple iPhone 14 Pro Max smartphone. Announced Sep 2022. Features 6.7″  display, Apple A16 Bionic chipset, 4323 mAh battery, 1024 GB storage, 6 GB RAM, Ceramic Shield glass."/><strong><span>iPhone 14 Pro Max</span></strong></a></li>

Its look Like we have Pretty Good Rich Data from li tag 

Within li tag there we have img tag . this tag  will help to extract the data for device name , launch_date ,features ,mobile_urls

In [59]:
data = first_mobile.find('img')["title"]
data

'Apple iPhone 14 Pro Max smartphone. Announced Sep 2022. Features 6.7″  display, Apple A16 Bionic chipset, 4323 mAh battery, 1024 GB storage, 6 GB RAM, Ceramic Shield glass.'

Split Function used to split the data for getting required formats

In [60]:
data_list=data.split(".")
data_list

['Apple iPhone 14 Pro Max smartphone',
 ' Announced Sep 2022',
 ' Features 6',
 '7″  display, Apple A16 Bionic chipset, 4323 mAh battery, 1024 GB storage, 6 GB RAM, Ceramic Shield glass',
 '']

In [61]:
first_mobile_name = data_list[0]
first_mobile_name

'Apple iPhone 14 Pro Max smartphone'

We have Extracted the Mobile Name  'Apple iPhone 14 Pro Max smartphone'

In [62]:
first_mobile_launch_date = data_list[1]
first_mobile_launch_date

' Announced Sep 2022'

We Have Extracted the Date it is Announced

The a tag which has href as attribute has data for device

In [63]:
first_mobile_url = first_mobile.find('a')['href']

In [64]:
first_mobile_url

'apple_iphone_14_pro_max-11773.php'

We have got the url for First mobile device. This when added with root or base url we will get the complete url 

The remain data after the URl is of feature. So we used it to extract data of feature file

In [66]:
tech = ""
    
for i in range(2,len(data_list)):
    if(i == 3):
        tech = tech + '.'
    tech = tech + data_list[i]
        
tech 

' Features 6.7″  display, Apple A16 Bionic chipset, 4323 mAh battery, 1024 GB storage, 6 GB RAM, Ceramic Shield glass'

We have got the data of first_mobile features

We will Now Put all required data in one for loop

In [67]:
# Creating list for collecting data
name = []
launch_date = []
features = []
mobile_urls = []

for mobile in brand_mobile_tags:
    data=mobile.find('img')["title"]  # resource containg all the data
    link = "https://www.gsmarena.com/"+mobile.find('a')["href"] # data containg web page URL
    mobile_urls.append(link) # adding every Brand URL in mobile_urls
    l = data.split(".") # spliting all the data and creating a list out of it
    last = len(l) - 1
    name.append(l[0]) # adding device name in name list
    launch_date.append(l[1]) # adding launch date in launch_date list
    
    tech = "" 
    
    for i in range(2,len(l)):
        if(i == 3):
            tech = tech + '.'
        tech = tech + l[i]
        
    features.append(tech) # adding feature in features list


In [68]:
name[:10] #lets get for first 10 mobiles

['Apple iPhone 14 Pro Max smartphone',
 'Apple iPhone 14 Pro smartphone',
 'Apple iPhone 14 Plus smartphone',
 'Apple iPhone 14 smartphone',
 'Apple Watch Ultra watch',
 'Apple Watch Series 8 watch',
 'Apple Watch Series 8 Aluminum watch',
 'Apple Watch SE (2022) watch',
 'Apple iPhone SE (2022) smartphone',
 'Apple iPad Air (2022) tablet']

In [69]:
launch_date[:10] # launch date of first 10 mobiles

[' Announced Sep 2022',
 ' Announced Sep 2022',
 ' Announced Sep 2022',
 ' Announced Sep 2022',
 ' Announced Sep 2022',
 ' Announced Sep 2022',
 ' Announced Sep 2022',
 ' Announced Sep 2022',
 ' Announced Mar 2022',
 ' Announced Mar 2022']

In [70]:
features[:10] # features of first 10 mobiles

[' Features 6.7″  display, Apple A16 Bionic chipset, 4323 mAh battery, 1024 GB storage, 6 GB RAM, Ceramic Shield glass',
 ' Features 6.1″  display, Apple A16 Bionic chipset, 3200 mAh battery, 1024 GB storage, 6 GB RAM, Ceramic Shield glass',
 ' Features 6.7″  display, Apple A15 Bionic chipset, 4323 mAh battery, 512 GB storage, 6 GB RAM, Ceramic Shield glass',
 ' Features 6.1″  display, Apple A15 Bionic chipset, 3279 mAh battery, 512 GB storage, 6 GB RAM, Ceramic Shield glass',
 ' Features 1.92″  display, Apple S8 chipset, 542 mAh battery, 32 GB storage, MIL-STD 810H certified, Sapphire crystal glass',
 ' Features 1.9″  display, Apple S8 chipset, 308 mAh battery, 32 GB storage, 1000 MB RAM, IP6X certified, Sapphire crystal glass',
 ' Features 1.9″  display, Apple S8 chipset, 308 mAh battery, 32 GB storage, 1000 MB RAM, IP6X certified, Ion-X strengthened glass',
 ' Features 1.78″  display, Apple S8 chipset, 296 mAh battery, 32 GB storage, 1000 MB RAM, Ion-X strengthened glass',
 ' Featur

In [71]:
mobile_urls[:10] # Device_URL'S of first 10 mobiles

['https://www.gsmarena.com/apple_iphone_14_pro_max-11773.php',
 'https://www.gsmarena.com/apple_iphone_14_pro-11860.php',
 'https://www.gsmarena.com/apple_iphone_14_plus-11862.php',
 'https://www.gsmarena.com/apple_iphone_14-11861.php',
 'https://www.gsmarena.com/apple_watch_ultra-11827.php',
 'https://www.gsmarena.com/apple_watch_series_8-11866.php',
 'https://www.gsmarena.com/apple_watch_series_8_aluminum-11864.php',
 'https://www.gsmarena.com/apple_watch_se_(2022)-11865.php',
 'https://www.gsmarena.com/apple_iphone_se_(2022)-11410.php',
 'https://www.gsmarena.com/apple_ipad_air_(2022)-11411.php']

Lets Create a dictionary of brand with above created dictionary

In [72]:
brand_dict = {
    'Device_Names':name,
    'Announced_Date':launch_date,
    'Features':features,
    'Device_Url':mobile_urls
}

In [73]:
brand_df = pd.DataFrame(brand_dict) # Converting into data frame from dictionary

In [74]:
brand_df

Unnamed: 0,Device_Names,Announced_Date,Features,Device_Url
0,Apple iPhone 14 Pro Max smartphone,Announced Sep 2022,"Features 6.7″ display, Apple A16 Bionic chip...",https://www.gsmarena.com/apple_iphone_14_pro_m...
1,Apple iPhone 14 Pro smartphone,Announced Sep 2022,"Features 6.1″ display, Apple A16 Bionic chip...",https://www.gsmarena.com/apple_iphone_14_pro-1...
2,Apple iPhone 14 Plus smartphone,Announced Sep 2022,"Features 6.7″ display, Apple A15 Bionic chip...",https://www.gsmarena.com/apple_iphone_14_plus-...
3,Apple iPhone 14 smartphone,Announced Sep 2022,"Features 6.1″ display, Apple A15 Bionic chip...",https://www.gsmarena.com/apple_iphone_14-11861...
4,Apple Watch Ultra watch,Announced Sep 2022,"Features 1.92″ display, Apple S8 chipset, 54...",https://www.gsmarena.com/apple_watch_ultra-118...
5,Apple Watch Series 8 watch,Announced Sep 2022,"Features 1.9″ display, Apple S8 chipset, 308...",https://www.gsmarena.com/apple_watch_series_8-...
6,Apple Watch Series 8 Aluminum watch,Announced Sep 2022,"Features 1.9″ display, Apple S8 chipset, 308...",https://www.gsmarena.com/apple_watch_series_8_...
7,Apple Watch SE (2022) watch,Announced Sep 2022,"Features 1.78″ display, Apple S8 chipset, 29...",https://www.gsmarena.com/apple_watch_se_(2022)...
8,Apple iPhone SE (2022) smartphone,Announced Mar 2022,"Features 4.7″ display, Apple A15 Bionic chip...",https://www.gsmarena.com/apple_iphone_se_(2022...
9,Apple iPad Air (2022) tablet,Announced Mar 2022,"Features 10.9″ display, Apple M1 chipset, 25...",https://www.gsmarena.com/apple_ipad_air_(2022)...


Now lets Make the functions for reusability with Code used From Above making it Generic

- The Function get_mobiles_dict is used to extract the useful data.
- The input for the function is brand_mobile_tags. 
- It will be return Brand_dict which has lists of 
   names, date, features, urls

In [6]:
def get_mobiles_dict(brand_mobile_tags):
    # Creating list for collecting data

    name = []
    launch_date = []
    features = []
    mobile_urls = []     
    

    for mobile in brand_mobile_tags:
       
        data=mobile.find('img')["title"] # getting resource containing all the data
        link = "https://www.gsmarena.com/"+mobile.find('a')["href"] #for complete url
        
        mobile_urls.append(link) # adding complete url
        l = data.split(".")  # splitting the data for col 
       
        name.append(l[0])
       
        launch_date.append(l[1])
        
        
        tech = "" 
    
        for i in range(2,len(l)):
            if(i == 3):
                tech = tech + '.'
            tech = tech + l[i]

        features.append(tech) # adding feature in features list

     # creating a dictionary out of list created   
    mobiles_dict = { 'Device_Names':name, 'Announced_Date':launch_date,'Features':features, 'Device_Url':mobile_urls}
    
    return mobiles_dict
        


scrape _brand_mobiles() function takes a brand url and returns the required data frame

In [7]:
def scrape_brand_mobiles(brand_page):
    # creating a response to store the webpage of brand
    response = requests.get(brand_page,headers = {'User-agent': 'Super Bot Power Level Over 9000'})
    
    if response.status_code != 200: # validation whether page is downloaded or not . code 200-> for success
        raise Exception('Failed to load data {}'.format(brand_page)) # Exception is raised
    
    brand_doc = BeautifulSoup(response.text,'html.parser') # parsing the webpage for data extraction
    brand_mobile_divison = brand_doc.find('div',class_='makers') # getting tags that consits of rich data
    brand_mobile_tags = brand_mobile_divison.find_all('li') 
    
    mobiles_dict = get_mobiles_dict(brand_mobile_tags) # returns a dictionary which has data of mobile
    mobiles_df = pd.DataFrame(mobiles_dict) # create a dataframe from a dictionary
    return mobiles_df
    
 

So when i call scrape_brand_mobiles it will return me dataframe all latest launched mobiles by that brand

In [11]:
Apple ="https://www.gsmarena.com/apple-phones-48.php" #apple phone url
Apple

'https://www.gsmarena.com/apple-phones-48.php'

In [12]:
scrape_brand_mobiles(Apple)

Unnamed: 0,Device_Names,Announced_Date,Features,Device_Url
0,Apple iPhone 14 Pro Max smartphone,Announced Sep 2022,"Features 6.7″ display, Apple A16 Bionic chip...",https://www.gsmarena.com/apple_iphone_14_pro_m...
1,Apple iPhone 14 Pro smartphone,Announced Sep 2022,"Features 6.1″ display, Apple A16 Bionic chip...",https://www.gsmarena.com/apple_iphone_14_pro-1...
2,Apple iPhone 14 Plus smartphone,Announced Sep 2022,"Features 6.7″ display, Apple A15 Bionic chip...",https://www.gsmarena.com/apple_iphone_14_plus-...
3,Apple iPhone 14 smartphone,Announced Sep 2022,"Features 6.1″ display, Apple A15 Bionic chip...",https://www.gsmarena.com/apple_iphone_14-11861...
4,Apple Watch Ultra watch,Announced Sep 2022,"Features 1.92″ display, Apple S8 chipset, 54...",https://www.gsmarena.com/apple_watch_ultra-118...
5,Apple Watch Series 8 watch,Announced Sep 2022,"Features 1.9″ display, Apple S8 chipset, 308...",https://www.gsmarena.com/apple_watch_series_8-...
6,Apple Watch Series 8 Aluminum watch,Announced Sep 2022,"Features 1.9″ display, Apple S8 chipset, 308...",https://www.gsmarena.com/apple_watch_series_8_...
7,Apple Watch SE (2022) watch,Announced Sep 2022,"Features 1.78″ display, Apple S8 chipset, 29...",https://www.gsmarena.com/apple_watch_se_(2022)...
8,Apple iPhone SE (2022) smartphone,Announced Mar 2022,"Features 4.7″ display, Apple A15 Bionic chip...",https://www.gsmarena.com/apple_iphone_se_(2022...
9,Apple iPad Air (2022) tablet,Announced Mar 2022,"Features 10.9″ display, Apple M1 chipset, 25...",https://www.gsmarena.com/apple_ipad_air_(2022)...


In [8]:

def scrape_brand_to_csv(brand_url,path): # creates a csv from given url and given path
    
    if(os.path.exists(path)): # check whether file exits or not
        print("The file {} already exists. skipping .... ".format(path))
        return
    mobiles_df = scrape_brand_mobiles(brand_url) # calls to get data frame out of a given url
    mobiles_df.to_csv(path,index=None) # creates a csv with given path
    

If i Call scrape_brand_to_csv it will save the dataframe to csv takes brand_url and destination path to save all csv files in the given destination

In [55]:
scrape_brand_to_csv(Apple,"apple-check10.csv")

The file apple-check10.csv already exists. skipping .... 


apple-check.csv in files tab

![](https://i.imgur.com/OtOk1aT.png)

view of apple-check  CSV file

![](https://i.imgur.com/GOhb0M3.png)

## Mega Driver Function 
Now we are going to create the driver functions which will use all the function and creates csv files for each and every brands consists of mobile data

In [9]:
def scrape_GSMArena(): # Mega Function Which Ultilizes all Function Above and saves all CSV files
    brands_df = scrape_brands() #for getting all brands df
    #create a folder
    os.makedirs('data',exist_ok=True) # creates a folder data
    print('-----------Scraping Started---------------')
    for index,row in brands_df.iterrows(): #iterates over brands_df and gives brands Urls for Scraping all mobiles
        print('Scraping {} Brand from GSMArena'.format(row['Brands']))
        scrape_brand_to_csv(row['Urls'],'data/{}.csv'.format(row['Brands']))
    print('-----------Scraping done Successful---------------')

In [10]:
scrape_GSMArena()

-----------Scraping Started---------------
Scraping Acer Brand from GSMArena
Scraping alcatel Brand from GSMArena
Scraping Allview Brand from GSMArena
Scraping Amazon Brand from GSMArena
Scraping Amoi Brand from GSMArena
Scraping Apple Brand from GSMArena
Scraping Archos Brand from GSMArena
Scraping Asus Brand from GSMArena
Scraping AT&T Brand from GSMArena
Scraping Benefon Brand from GSMArena
Scraping BenQ Brand from GSMArena
Scraping BenQ-Siemens Brand from GSMArena
Scraping Bird Brand from GSMArena
Scraping BlackBerry Brand from GSMArena
Scraping Blackview Brand from GSMArena
Scraping BLU Brand from GSMArena
Scraping Bosch Brand from GSMArena
Scraping BQ Brand from GSMArena
Scraping Casio Brand from GSMArena
Scraping Cat Brand from GSMArena
Scraping Celkon Brand from GSMArena
Scraping Chea Brand from GSMArena
Scraping Coolpad Brand from GSMArena
Scraping Dell Brand from GSMArena
Scraping Doogee Brand from GSMArena
Scraping Emporia Brand from GSMArena
Scraping Energizer Brand from GS

A folder of name Data is Created. It consists of 119 Brand mobile CSV's. Its totally the All mobile Brand latestly launched devices . It is very Rich in data

![](https://i.imgur.com/U7757Dd.png)

![](https://i.imgur.com/lqyBBjz.png)

## Saving Files In Data Folder at Files Section

In [62]:
import jovian
jovian.commit(files = ["data/"])

<IPython.core.display.Javascript object>

[jovian] Updating notebook "yallabalaji1/gsmarena-mobile-devices-scraping" on https://jovian.ai[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/yallabalaji1/gsmarena-mobile-devices-scraping[0m


'https://jovian.ai/yallabalaji1/gsmarena-mobile-devices-scraping'

# Summary:

To summarise all that we have covered in this project "#  WebScraping Latest Mobile-Devices from GSMArena
" are:

1. Download the webpage using requests.
2. Parse the HTML Source code using Beautiful Soup.
3. Extract data like ```Brand Name ```,```Number Of Devices Launched ```,```Brand URL```into a CSV file in the format

4. Compile the above information in lists and dictionaries using python.

5. Extract and combine the data from Each Brand using BrandUrl from previous list.

6. Write the extracted information into a CSV file in the format:

```device_name```,```launched_date```,```features,device_url```

7. By end we have found that in mobile industry there exists 119 brands and we got all details of latest launched devices from each every brand. Total we got 120 CSV files.





# References:

1. Web Scraping Guide [Jovian](https://jovian.ai/learn/zero-to-data-analyst-bootcamp)

2. For Errors and Trouble shoot [stackoverflow](https://stackoverflow.com)

3. Beautiful Soup [BS4](https://beautiful-soup-4.readthedocs.io/en/latest/)


# Future Work:

1. More Categorizing of Data Can be done on Feature list by using Filters and Others.

2. We can also get more data by going inside every mobile page from each brand to get each and every sepcification.

3. By Taking consideration from reviews inside each mobie page we can get recommend Best mobile brand and Best mobile based on specifications.

In [None]:
import jovian
jovian.commit()

<IPython.core.display.Javascript object>