# **Retail Store Location Scraper** 
## **Store - V-Mart**


# **Importing required libraries**

**requests**: This library is used to make HTTP requests to a website and retrieve the HTML content of a web page.


**pandas**: This library is used for data manipulation and analysis. It provides data structures for efficiently storing and analyzing data, as well as tools for importing and exporting data from various file formats.


**re**: This is the built-in regular expression library in Python. It provides functions for working with regular expressions, which are patterns used to match and manipulate text.


**BeautifulSoup**: This library is used for parsing HTML and XML documents. It provides tools for navigating and searching the structure of an HTML/XML document, and extracting specific elements and data from it.

In [1]:
import requests
import pandas as pd
import re
from bs4 import BeautifulSoup

# Sendin a GET request to the website we want to scrape 
# I took V-mart store for this project
url = 'https://stores.vmartretail.com/'
response = requests.get(url)

print(response)

<Response [200]>


# **request status**
200 status code is a successful response from the server indicating that the request was processed and completed successfully. This is the most common status code that you will encounter when making HTTP requests.

**Creating a dataFrame df to store our data**

In [2]:
df = pd.DataFrame(columns=["address", "area", "contact_no", "coordinates", "timing",'directions'])


# **Address**
**find_all** is a method of the soup object that searches the HTML document for all <span> elements with a class attribute of "**store-address-info idx-info-card-str-add-info**".

The **for** loop iterates over each <span> element in the list of spans returned by find_all. For each element, the text attribute is extracted and stored in the variable address.

A new Pandas DataFrame is created with a single column called address containing the extracted text, and this DataFrame is concatenated with an existing DataFrame called df using the pd.concat function.

The ignore_index=True argument is used to reset the index of the concatenated DataFrame.

In [3]:

# Parse the HTML content of the page using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

spans = soup.find_all("span", class_="store-address-info idx-info-card-str-add-info")

# loop through the list of <span> elements and add the text to the DataFrame
for span in spans:
    address = span.text.strip()
    df = pd.concat([df, pd.DataFrame({"address": [address]})], ignore_index=True)


# **Area**
**find_all** is a method of the soup object that searches the HTML document for all **h2** elements with a class attribute of "**idx-info-card-str-name mb-0**".


The **if not spans:** statement checks if any matching **h2** elements were found. If no matches were found, an exception is raised.

The for loop iterates over each** h2** element in the list of spans returned by **find_all**. For each element, the text attribute is extracted and stored in the variable area. The enumerate function is used to provide an index for each element, and the at method is used to add the area value to the corresponding row of the DataFrame.

The **if df.empty:** statement checks if any data was added to the DataFrame. If no data was added, an exception is raised.

The **str.replace** method is used to remove the string "V-Mart " from each value in the "area" column of the DataFrame.

In [4]:
spans = soup.find_all("h2", class_="idx-info-card-str-name mb-0")

if not spans:
    raise Exception(f"No matching spans found")

for i, span in enumerate(spans):
    area = span.text.strip()
    df.at[i, "area"] = area

if df.empty:
    raise Exception(f"No data added to DataFrame")
df['area'] = df['area'].str.replace('V-Mart ', '')



# **Contact number**
This code uses the BeautifulSoup library to **extract all anchor tags** (a) that have **a tel: href attribute**. Then, it loops through each anchor tag, extracts the phone number from its href attribute using the unquote() method, and appends it to a new DataFrame c. 

Finally, the phone numbers in c are added to an existing DataFrame df as a new column named "**contact_no**".

In [5]:
from urllib.parse import unquote
tags = soup.find_all("a", href=lambda href: href and href.startswith("tel:"))

# Extracting the phone number from each tag's href attribute and append to the dataframe
c = pd.DataFrame(columns=[ "contact_no"])
for tag in tags:
    phone_number = unquote(tag['href'].split(":")[1])
    c = pd.concat([c, pd.DataFrame({"contact_no": [phone_number]})], ignore_index=True)

df["contact_no"] = c["contact_no"]

# **Timing**
This code uses the BeautifulSoup library to **extract all span tags** (span) that have a **class name** of "**store-time-info idx-info-card-str-time-info**". 

Then, it **loops** through each span tag, **extracts** the time information from its text using the **text.strip() method,** and appends it to a new DataFrame t. Finally, the time information in t is added to an existing DataFrame df as a new column named "**timing**".

In [6]:

spans = soup.find_all("span", class_="store-time-info idx-info-card-str-time-info")
t = pd.DataFrame(columns=[ "timing"])

# loop through the list of <span> elements and add the text to the DataFrame
for span in spans:
    timing = span.text.strip()
    t = pd.concat([t, pd.DataFrame({"timing": [timing]})],ignore_index=True)

df["timing"] = t["timing"]

In [7]:
df.head()

Unnamed: 0,address,area,contact_no,coordinates,timing,directions
0,"No D 3, Block 1, Central Market 2, Lajpat Naga...","Lajpat Nagar, South East Delhi",7872478724,,10:00 AM to 10:00 PM (Closing Soon),
1,"Shop No 610 And 611, Ravi Das Marg, Main Bazar...","Laxmi Nagar, New Delhi",0844 8982 285,,10:00 AM to 10:00 PM (Closing Soon),
2,"C 7, Jyoti Nagar West, North Chajjupur, Block ...","Shahdara, New Delhi",0844 8982 285,,10:00 AM to 10:00 PM (Closing Soon),
3,"No E/561/A & E/561B, Palam Extension, Dwarka, ...","Dwarka, New Delhi",0931 3699 995,,10:00 AM to 10:00 PM (Closing Soon),
4,"Virendar Nagar, Block B, Sant Nagar, 100 feet ...","Burari, North Delhi",0931 3699 995,,10:00 AM to 10:00 PM (Closing Soon),


# **Location Links**
This code is **extracting** the **href** values from **anchor tags** that have a** child i tag** with a **specific class attribute**. 

It first uses the **find_all() method** to **find all i tags** with **class attribute "flaticon-location-arrow align-middle map-icn mr-1"**. 

Then it **creates an empty list to store the href values**. 

It **loops** through the **i tags** and **gets** the **parent anchor tag**. I**f the parent anchor tag has an href attribute**, the** href value is added to the list**. 

Finally, a Pandas DataFrame is created with the href values and printed.

In [8]:
tags = soup.find_all("i", {"class": "flaticon-location-arrow align-middle map-icn mr-1"})

# Creating an empty list to store the href values
href_list = []

for tag in tags:
    # Getting the parent anchor tag of the <i> tag
    parent_tag = tag.parent
    
    # Checking if the href attribute exists in the parent anchor tag
    if "href" in parent_tag.attrs:
        # Append the href value to the list
        href_list.append(parent_tag["href"])

# Creating a Pandas DataFrame with the href values
location = pd.DataFrame({"href": href_list})

# Print the DataFrame
print(location)

                                                 href
0   https://www.google.com/maps/dir/?api=1&origin=...
1   https://www.google.com/maps/dir/?api=1&origin=...
2   https://www.google.com/maps/dir/?api=1&origin=...
3   https://www.google.com/maps/dir/?api=1&origin=...
4   https://www.google.com/maps/dir/?api=1&origin=...
5   https://www.google.com/maps/dir/?api=1&origin=...
6   https://www.google.com/maps/dir/?api=1&origin=...
7   https://www.google.com/maps/dir/?api=1&origin=...
8   https://www.google.com/maps/dir/?api=1&origin=...
9   https://www.google.com/maps/dir/?api=1&origin=...
10  https://www.google.com/maps/dir/?api=1&origin=...
11  https://www.google.com/maps/dir/?api=1&origin=...
12  https://www.google.com/maps/dir/?api=1&origin=...
13  https://www.google.com/maps/dir/?api=1&origin=...
14  https://www.google.com/maps/dir/?api=1&origin=...
15  https://www.google.com/maps/dir/?api=1&origin=...
16  https://www.google.com/maps/dir/?api=1&origin=...
17  https://www.google.com/m

# **Storing the links in other column**

In [9]:
df['directions'] = location['href']

# **Coordinates**
This code **defines** a **regular** **expression** **pattern** to extract latitude and longitude values from a link **bold text**.

 It then **defines a function**, **extract_coords()**, that takes a link and **uses** the **regular expression pattern to extract the latitude and longitude values**. 
 
If a match is found, the function returns a string of the **latitude and longitude separated by a comma**. 

**Otherwise, it returns None**. 

Finally, **the apply()** method is used to apply the **extract_coords**() function to the href column of the location DataFrame to extract the coordinates. 

The **href** **column** now contains a string of the **latitude and longitude values for each location**.

In [10]:
# Define a regular expression pattern to extract the latitude and longitude values
pattern = r"destination=([\d\.]+),([\d\.]+)"

# Define a function to extract the latitude and longitude values from a link
def extract_coords(link):
    match = re.search(pattern, link)
    if match:
        latitude = match.group(1)
        longitude = match.group(2)
        return f"{latitude},{longitude}"
    else:
        return None

# Applying the extract_coords function to the href column to extract the coordinates
location["href"] = location["href"].apply(extract_coords)

In [11]:
df['coordinates'] = location['href']

In [12]:
df.head()

Unnamed: 0,address,area,contact_no,coordinates,timing,directions
0,"No D 3, Block 1, Central Market 2, Lajpat Naga...","Lajpat Nagar, South East Delhi",7872478724,"28.568548,77.242454",10:00 AM to 10:00 PM (Closing Soon),https://www.google.com/maps/dir/?api=1&origin=...
1,"Shop No 610 And 611, Ravi Das Marg, Main Bazar...","Laxmi Nagar, New Delhi",0844 8982 285,"28.6365793,77.2790923",10:00 AM to 10:00 PM (Closing Soon),https://www.google.com/maps/dir/?api=1&origin=...
2,"C 7, Jyoti Nagar West, North Chajjupur, Block ...","Shahdara, New Delhi",0844 8982 285,"28.689468,77.2880388",10:00 AM to 10:00 PM (Closing Soon),https://www.google.com/maps/dir/?api=1&origin=...
3,"No E/561/A & E/561B, Palam Extension, Dwarka, ...","Dwarka, New Delhi",0931 3699 995,"28.583891,77.071784",10:00 AM to 10:00 PM (Closing Soon),https://www.google.com/maps/dir/?api=1&origin=...
4,"Virendar Nagar, Block B, Sant Nagar, 100 feet ...","Burari, North Delhi",0931 3699 995,"28.7427944,77.1980152",10:00 AM to 10:00 PM (Closing Soon),https://www.google.com/maps/dir/?api=1&origin=...


In [13]:
df.to_csv('data.csv', index=False)

# download the CSV file
from google.colab import files
files.download('data.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>