# Zakupki website scraping for Piotr

The aim of this notebook is to scrape details of each contract hosted on the Russian Zakupki public sector contract awarding website.

The input for this project will be the Zakupki URL. This code can be run on different dates to pull fresh contract data.

Method:
1.   Identify the number of pages of contracts to be scraped (using the contract filters provided).
2.   Iterate through each page, scraping the registration number of each contract.
3.   Access the website for each contract by placing the registraion number in the URL.
4.   Scrape the details for each contract and add them to a list of Contracts dataclasses.
5.   Format these Contract objects as a dataframe and output the dataframe to a csv file.


The output of this project will be the CSV file, with each row representing a new contract from the webstie.


### Section 1: Setup

In [1]:
import requests
from requests.adapters import HTTPAdapter
from urllib.parse import urlparse, parse_qs
from urllib3.util import Retry
from bs4 import BeautifulSoup
from datetime import date
from dataclasses import dataclass
from tqdm import tqdm
from dateutil import parser
from threading import Thread
import pandas as pd
from datetime import datetime, date, timedelta
import logging
import http.client
from functools import lru_cache
from concurrent.futures import ThreadPoolExecutor, as_completed
import math
from os import walk
import json
import numpy as np
import csv


In [2]:
## Finding the memory leak

from collections import Counter
import linecache
import os
import tracemalloc

def display_top(snapshot, key_type='lineno', limit=3):
    snapshot = snapshot.filter_traces((
        tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
        tracemalloc.Filter(False, "<unknown>"),
    ))
    top_stats = snapshot.statistics(key_type)

    print("Top %s lines" % limit)
    for index, stat in enumerate(top_stats[:limit], 1):
        frame = stat.traceback[0]
        # replace "/path/to/module/file.py" with "module/file.py"
        filename = os.sep.join(frame.filename.split(os.sep)[-2:])
        print("#%s: %s:%s: %.1f KiB"
              % (index, filename, frame.lineno, stat.size / 1024))
        line = linecache.getline(frame.filename, frame.lineno).strip()
        if line:
            print('    %s' % line)

    other = top_stats[limit:]
    if other:
        size = sum(stat.size for stat in other)
        print("%s other: %.1f KiB" % (len(other), size / 1024))
    total = sum(stat.size for stat in top_stats)
    print("Total allocated size: %.1f KiB" % (total / 1024))


tracemalloc.start()

In [3]:
logging = False

if logging:

    http.client.HTTPConnection.debuglevel = 1

    # You must initialize logging, otherwise you'll not see debug output.
    logging.basicConfig()
    logging.getLogger().setLevel(logging.DEBUG)
    requests_log = logging.getLogger("requests.packages.urllib3")
    requests_log.setLevel(logging.DEBUG)
    requests_log.propagate = True

### Section 2: Determine Number of pages to scrape
Test connection to the website and determine number of pages to scrape

In [4]:
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)

# @lru_cache(maxsize=None)
def getPage(tempURL):
  # If User-Agent is not set to custom, the website will know a Python script is accessing it and block some of the request

  response = session.get(tempURL, headers={'User-Agent': 'Custom'})
  return BeautifulSoup(response.content, "html.parser")

In [5]:
# Getting the dates we want to scrape.

#url="https://zakupki.gov.ru/epz/contractfz223/search/results.html?morphology=on&sortDirection=false&recordsPerPage=_50&showLotsInfoHidden=false&statuses_0=on&statuses_1=on&statuses=0%2C1&priceFrom=1000000&currencyId=-1&contract223DateFrom={}&contract223DateTo={}&sortBy=BY_UPDATE_DATE&pageNumber={}&customerPlace=5277383"
url="https://zakupki.gov.ru/epz/order/extendedsearch/results.html?morphology=on&sortBy=UPDATE_DATE&sortDirection=false&recordsPerPage=_500&showLotsInfoHidden=false&fz223=on&pc=on&priceContractAdvantages44IdNameHidden=%7B%7D&priceContractAdvantages94IdNameHidden=%7B%7D&priceFromGeneral=1000000&priceFromGWS=%D0%9C%D0%B8%D0%BD%D0%B8%D0%BC%D0%B0%D0%BB%D1%8C%D0%BD%D0%B0%D1%8F%D1%86%D0%B5%D0%BD%D0%B0&priceFromUnitGWS=%D0%9C%D0%B8%D0%BD%D0%B8%D0%BC%D0%B0%D0%BB%D1%8C%D0%BD%D0%B0%D1%8F%D1%86%D0%B5%D0%BD%D0%B0&priceToGWS=%D0%9C%D0%B0%D0%BA%D1%81%D0%B8%D0%BC%D0%B0%D0%BB%D1%8C%D0%BD%D0%B0%D1%8F%D1%86%D0%B5%D0%BD%D0%B0&priceToUnitGWS=%D0%9C%D0%B0%D0%BA%D1%81%D0%B8%D0%BC%D0%B0%D0%BB%D1%8C%D0%BD%D0%B0%D1%8F%D1%86%D0%B5%D0%BD%D0%B0&currencyIdGeneral=-1&publishDateFrom={}&publishDateTo={}&pageNumber={}&customerPlace=5277383&selectedSubjectsIdNameHidden=%7B%7D&okdpGroupIdsIdNameHidden=%7B%7D&koksIdsIdNameHidden=%7B%7D&OrderPlacementSmallBusinessSubject=on&OrderPlacementRnpData=on&OrderPlacementExecutionRequirement=on&orderPlacement94_0=0&orderPlacement94_1=0&orderPlacement94_2=0&contractPriceCurrencyId=-1&budgetLevelIdNameHidden=%7B%7D&nonBudgetTypesIdNameHidden=%7B%7D"
startDate = date(2016, 1, 1)
endDate = date(2017, 12, 31)
days = timedelta(days=1)

startDateBefore = startDate

calendar=[]

while startDate<=endDate:
  calendar.append(startDate.strftime('%d.%m.%Y'))
  startDate+=days

print("Created {} dates".format(len(calendar)))
# startDate = date(2016, 1, 1)
# endDate = date(2016, 1, 31)
# days = timedelta(days=1)

# startDateBefore = startDate

# calendar=[]

# while startDate<=endDate:
#   calendar.append(startDate.strftime('%d.%m.%Y'))
#   startDate+=days

# print("Created {} dates".format(len(calendar)))

Created 731 dates



### Section 3: Scrape each registration number

Scrape the reg numbers of each contract, so they can be accessed individually

In [6]:
class Page:

    def __init__(self, day, pageNum, pagefile):

        self.day = day
        self.pageNum = pageNum
        self.pagefile = pagefile


In [7]:
### REMAKE FOR TENDERS, NOT CONTRACT REGISTRY FOR MORE DATA ###

def getContracts(page):
# url="https://zakupki.gov.ru/epz/order/extendedsearch/results.html?morphology=on&search-filter=%D0%94%D0%B0%D1%82%D0%B5+%D1%80%D0%B0%D0%B7%D0%BC%D0%B5%D1%89%D0%B5%D0%BD%D0%B8%D1%8F&pageNumber=1&sortDirection=false&recordsPerPage=_500&showLotsInfoHidden=false&sortBy=UPDATE_DATE&fz223=on&pc=on&priceFromGeneral=1000000&currencyIdGeneral=-1&publishDateFrom=01.01.2016&publishDateTo=08.01.2016&customerPlace=5277383&customerPlaceCodes=66000000000&OrderPlacementSmallBusinessSubject=on&OrderPlacementRnpData=on&OrderPlacementExecutionRequirement=on&orderPlacement94_0=0&orderPlacement94_1=0&orderPlacement94_2=0"
# soup=getPage(url)

 
    # Obtain a list of all the sections of HTML containing a contract in the web page
    listOfContracts = page.find_all("div", {"class": "registry-entry__header-mid__number"})
    regNumbersList=[]

    for contract in listOfContracts:
                regNum = contract.find("a")['href'].split("Id=")[1] ### changed from =id?
                regNumbersList.append(regNum)
                
    return regNumbersList
# print(regNumbersList)

In [8]:
def progress(idx, data):

    x_ = int(((idx+1) * 100) / len(data))
    y_ = idx % math.ceil(len(data) / 10)
    
    print(" ----\n{}% completed\n----".format(x_)) if y_ == 0 else None

In [9]:
# Getting the web page for all the contracts for each date in the range we want to scrape.

regNumbersDict = {}

def getRegNumbersForDate(i, day):

  if day in regNumbersDict:
    return

  tempURL = url.format(day, day, 1)

  # print(tempURL)

  page = getPage(tempURL)

  # Scrape the max number of pages
  try:
    maxPageNum = int(page.select('a[data-pagenumber]')[-2].find("span").text)
    print("{} pages for this day".format(maxPageNum))
  except:
    maxPageNum = 1


  # Leave my variable names alone :(
  totalRegNumbersForThisDay = 0

  for i in range(1, maxPageNum+1):

    # Creating a temporary URL for each page containing contracts
    tempPageURL = url.format(day, day, i)

    # Request the page and format it as a BeautifulSoup object so that we can perform scrapings
    page = getPage(tempPageURL)

    regNumbersList = getContracts(page)

    totalRegNumbersForThisDay += len(regNumbersList)

    regNumbersDict[day] = regNumbersList


  print("Fetched day {} had {} contracts \n".format(day, totalRegNumbersForThisDay), end='')

  del page
  del regNumbersList

  progress(i, calendar)


In [10]:
### This part doesn't work, script is not going through pages, scrapes only first e.g. out of 3, 
### and doesn't save contract regnumbers 
## This now has regNumbers caching too 

cachedRegNums223 = {}

# load the data from the json file
with open('cachedRegNums223.json', 'r') as f:
  cachedRegNums223 = json.load(f)


with ThreadPoolExecutor(max_workers=50) as ex:
  for i, day in enumerate(calendar):
    if day in cachedRegNums223:
      regNumbersDict[day] = cachedRegNums223[day]
      print("Cached day {} had {} contracts \n".format(day, len(regNumbersDict[day])), end='')
    else:
      ex.submit(getRegNumbersForDate, i, day)

    
combinedRegNumbersDict = {**cachedRegNums223, **regNumbersDict}

# print(combinedRegNumbersDict)


with open('cachedRegNums223.json', 'w') as f:
  json.dump(combinedRegNumbersDict, f)


tempRegNumbers = list(regNumbersDict.values())

regNumbers = []

for t in tempRegNumbers:
  regNumbers.extend(t)

# print(regNumbers)

print("------------------- \n {} contracts found in total".format(len(regNumbers)))

https://zakupki.gov.ru/epz/order/extendedsearch/results.html?morphology=on&sortBy=UPDATE_DATE&sortDirection=false&recordsPerPage=_500&showLotsInfoHidden=false&fz223=on&pc=on&priceContractAdvantages44IdNameHidden=%7B%7D&priceContractAdvantages94IdNameHidden=%7B%7D&priceFromGeneral=1000000&priceFromGWS=%D0%9C%D0%B8%D0%BD%D0%B8%D0%BC%D0%B0%D0%BB%D1%8C%D0%BD%D0%B0%D1%8F%D1%86%D0%B5%D0%BD%D0%B0&priceFromUnitGWS=%D0%9C%D0%B8%D0%BD%D0%B8%D0%BC%D0%B0%D0%BB%D1%8C%D0%BD%D0%B0%D1%8F%D1%86%D0%B5%D0%BD%D0%B0&priceToGWS=%D0%9C%D0%B0%D0%BA%D1%81%D0%B8%D0%BC%D0%B0%D0%BB%D1%8C%D0%BD%D0%B0%D1%8F%D1%86%D0%B5%D0%BD%D0%B0&priceToUnitGWS=%D0%9C%D0%B0%D0%BA%D1%81%D0%B8%D0%BC%D0%B0%D0%BB%D1%8C%D0%BD%D0%B0%D1%8F%D1%86%D0%B5%D0%BD%D0%B0&currencyIdGeneral=-1&publishDateFrom=01.01.2016&publishDateTo=01.01.2016&pageNumber=1&customerPlace=5277383&selectedSubjectsIdNameHidden=%7B%7D&okdpGroupIdsIdNameHidden=%7B%7D&koksIdsIdNameHidden=%7B%7D&OrderPlacementSmallBusinessSubject=on&OrderPlacementRnpData=on&OrderPlacemen

### Section 4: Details scraping

The Contract Dataclass will store the information during scraping.
If any information can't be scraped, default values have been provided in their place

In [None]:
@dataclass
class Contract:

  # TODO: Add reg number to class

  # Main Section
  id: float = 0
  price: float = 0.0
  signed: date = None
  # deadline: date = None

  # Tab 1
  method: str = "none"
  procurer: str = "none"
  # supplier: str = "none"
  proinn: str = "none"
  # supinn: str = "none"
  # registered: date = None
  person: str= "none"
  address: str = "none"
  number: str = "none"
  mail: str = "none"

  # Tab 2
  code: float = 0.0
  product: str = "none"
  

  def __repr__(self):
    return "\nContract id= {} \n First tab: price={}, signed={}, method={}, procurer={}, proinn={}, person={}, address={}, number={}, mail={}, product={} \n  Second tab: code={})".format(self.id, self.price, self.signed, self.method, self.procurer, self.proinn, self.person, self.address, self.number, self.mail, self.product, self.code)

  

Method for scraping the data from each contract

In [None]:
def getSectionDict(page):

    sections=page.findAll("div",{"class":"col-9 mr-auto"})

    # print([key.findAll("span") for key in sections])

    # Turning the sections into a dictionary that will be easier to work with.
    pairs = [key.findAll("div") for key in sections]

    pairs = list(filter(None, pairs))


    titles = []
    values = []

    for x in pairs:
        if len(x) > 1:
            try:
                titles.append(x[0])
                values.append(x[1])
            except:
                pass

    sectionDict = {titles[i].text.strip() : values[i].text.strip() for i in range(len(titles))}

    return sectionDict


# def getTableDict(page, secondTab=False):


#     if secondTab:
#         sectionOfInterest = page.findAll("div", {"class": "col"})[-1]
#     else:
#         sectionOfInterest = page

#     table = sectionOfInterest.findAll("tr",{"class":"tableBlock__row"})

#     # print(table[3])

#     headers = [i.text.strip() for i in table[0].findAll("th", {"class":"tableBlock__col tableBlock__col_header"})]
#     data = [list(filter(None, [j.strip() for j in i.text.split("\n")])) for i in table[1].findAll("td")]

#     if len(headers) == 0:
#         headers = [i.text.strip() for i in table[2].findAll("th", {"class":"tableBlock__col tableBlock__col_header"})]
#         data = [list(filter(None, [j.strip() for j in i.text.split("\n")])) for i in table[3].findAll("td")]

#     if len(headers) == 0:
#         headers = [i.text.strip() for i in table[3].findAll("th", {"class":"tableBlock__col tableBlock__col_header"})]
#         data = [list(filter(None, [j.strip() for j in i.text.split("\n")])) for i in table[4].findAll("td")]



#     # This is hacky.
#     if len(data) < len(headers):
#         data = [[[]] for i in range(len(headers))]

        

#     tableDict = {headers[i] : data[i] for i in range(len(headers))}

#     return tableDict

In [None]:
def scrapeData(reg):

  try:
    # Input: reg = one registration number.

    # Different URL from the one above, this accesses more information from Zakupki.
    dir="https://zakupki.gov.ru/epz/order/notice/notice223/{}.html?noticeInfoId={}"
    #"https://zakupki.gov.ru/epz/contract/contractCard/{}.html?reestrNumber={}"

    # Getting the web page for the given contract
    tempDir = dir.format("common-info", reg)
    page = getPage(tempDir)

    # We probably don't need this with the method I've used below.
    # contractTypeTwo = False

    # Enter the text here that should be present to signify the second type of contract.
    # if page.findAll(text="Основание заключения контракта с единственным поставщиком"):
    #   contractTypeTwo = True
      
    id = reg  
    sectionDict = getSectionDict(page)
    # firstTableDict = getTableDict(page)

    # print(sectionDict)
    # print(firstTableDict)

    # print(sectionDict, firstTableDict)
    # try:
    #   price=sectionDict["Цена контракта"].replace("\xa0","").replace(",",".").replace("₽","").strip().split()[0]
    # except:
    #   price=sectionDict["Ориентировочное значение цены контракта"].replace("\xa0","").replace(",",".").replace("₽","").strip().split()[0]
    #   try:
    #     price=sectionDict["Максимальное значение цены контракта"].replace("\xa0","").replace(",",".").replace("₽","").strip().split()[0]
    #   except:
    try:
      price=page.find('div', {'class':'price-block__value'}).text.strip().replace("₽","").replace(",",".").replace(" ", "")
    except: 
      price=""
    try:
      signed=page.find('div',{'class':'data-block__value'}).text.strip()
    except:
      signed=""
    # deadline=sectionDict["Дата окончания исполнения контракта"].split()[0]
    
    ### fixed issue with method ### 
    try:
      method = sectionDict["Способ осуществления закупки"]
    except:
      if page.findAll(text="Основание заключения контракта с единственным поставщиком"):
            method="Закупка у единственного поставщика (подрядчика, исполнителя)"
            
    procurer=sectionDict["Наименование организации"]
    # supplier=firstTableDict["Организация"][0]

    proinn=page.find('div', {'class':'ml-1 common-text__value'}).text.strip()

    ### fixed issues for missing values sometimes in the table ###
    
    # registered=firstTableDict["Организация"][-1]
    
    ### fixed, testing ###   
    # try: 
    #   if firstTableDict["Организация"][-4]=="КПП:":
    #       supinn=firstTableDict["Организация"][-5]
    #   else:
    #       supinn=firstTableDict["Организация"][-3]
    # except:
    #   supinn=""

    ### fixed issues in lower table ### 
    try:
      address=sectionDict['Место нахождения']
    except:
      address=""
    try:
      mail=sectionDict['Адрес электронной почты']
    except:
      mail=""
    number=sectionDict['Контактный телефон']
    try:
      product=sectionDict['Наименование закупки']
    except:
      product=""
    try:
      person=sectionDict['Контактное лицо']
    except:
      person=""


    ### details about winner - ALSO, THERE'S OPTION TO SCRAPE SUBCONTRACTORS ### 

    page.decompose()

    ### Second tab ###
    tempDir = dir.format("lot-list", reg)
    page = getPage(tempDir)

    ### code stands for the product code, which can be later identified to return industry type ### 
    
    # secondTableDict = getTableDict(page, True)
    
    codetablevalues=page.findAll({"td":"class"})
    
    try:
      code = code=codetablevalues[3].text.strip().split(' ')[0]
    except:
      code = ''
        
    # product = secondTableDict["Наименование объекта закупки и его характеристики"][0]
    
    # Create the Contract dataclass object and append it to a list of objects.
    # This method means that missing data can be accounted for.
    # print(method)

    contract = Contract(id=id, price=price, signed=signed, method=method, procurer=procurer, proinn=proinn, address=address, person=person, number=number, mail=mail, code=code, product=product)
    
    # contracts.append(contract)
    # print('Completed {}'.format(id))

    page.decompose()

    return contract
  except Exception as e:
    failedRegNumbers.append(reg)
    print("Failed to scrape {}".format(reg))
    print(e)

### Section 5: Starting execution
Scrape the contracts themselves using threading

In [None]:
def scrape(reg):
    
    try:
        _ = int(reg)
        # print("Scraping {}".format(reg))
        return [scrapeData(reg)]
    except TypeError:
        
        # TODO make 500 contracts change here.
        
        contracts = []

        

        for idx, r in enumerate(reg):
            # print("Scraping {}".format(r))
            contracts.append(scrapeData(r))
            # progress(i + idx, regNumbers)

        return contracts

In [None]:
progressNum = 0

failedRegNumbers = []

# regNumbers = [14511297]

print("Starting scrape with {} reg numbers\n".format(len(regNumbers)))

# scrape(regNumbers[:10])

# for contract in contracts:
#     print(contract)

# for regNumber in tqdm(regNumbers[:50]):
#   thread = Thread(target = scrapeData, args = (regNumber,))
#   thread.start()

# regNumbers = ['3662502457421000001']
# regNumbers = ['1665800691921000016']

threading = True

if threading:

    interval = 1

    with ThreadPoolExecutor(max_workers=20) as ex:
        threads = []

        cachedContracts = {}

        with open('cachedContracts.csv', encoding="utf-8") as f:
            cachedContracts = list(csv.reader(f))
        
        cachedContracts = list(filter(None, cachedContracts))
        
        cachedContractRegNums = [row[0] for row in cachedContracts]

        uncachedRegNumbers = list(set(regNumbers) - set(cachedContractRegNums))

        print("{} of {} contracts are uncached. Fetching...".format(len(uncachedRegNumbers), len(regNumbers)))
        
        for i in range(0, len(uncachedRegNumbers), interval):
            tempNumbers = uncachedRegNumbers[i:i+interval]
            # print(tempNumbers)
            threads.append(ex.submit(scrape, tempNumbers))
        

        completed = 0

        for result in tqdm(as_completed(threads)):

            try:
                contracts = result.result()

                completed += interval

                # progress(completed, uncachedRegNumbers)

                formattedContracts = [list(contract.__dict__.values()) for contract in contracts]

                # newCachedContracts = {**cachedContracts, **formattedContracts}

                with open('cachedContracts.csv', 'a', encoding="utf-8", newline='') as f:
                    writer = csv.writer(f)
                    writer.writerows(formattedContracts)
            except Exception as e:
                print(e)
                
            # print("{} finished".format(interval))

else:

    scrape(regNumbers)

print("Scraped {} contracts".format(len(contracts)))
print("Failed to scrape {} contracts".format(len(failedRegNumbers)))
print(failedRegNumbers) if len(failedRegNumbers) > 0 else None

snapshot = tracemalloc.take_snapshot()
display_top(snapshot)

Starting scrape with 758 reg numbers

758 of 758 contracts are uncached. Fetching...


15it [00:16,  1.96it/s]

Failed to scrape 3855433
'NoneType' object has no attribute 'text'


18it [00:20,  1.22it/s]

'NoneType' object has no attribute '__dict__'


27it [00:32,  1.14s/it]

Failed to scrape 3873466
'NoneType' object has no attribute 'text'


34it [00:44,  1.74s/it]

'NoneType' object has no attribute '__dict__'

35it [00:45,  1.41s/it]




74it [01:43,  2.23s/it]

Failed to scrape 3823135
'NoneType' object has no attribute 'text'


76it [01:46,  1.71s/it]

Failed to scrape 3890634
'NoneType' object has no attribute 'text'


96it [02:12,  1.46s/it]

Failed to scrape 3944937
'NoneType' object has no attribute 'text'


116it [02:35,  1.11it/s]

'NoneType' object has no attribute '__dict__'


120it [02:40,  1.23it/s]

'NoneType' object has no attribute '__dict__'


126it [02:48,  1.58s/it]

Failed to scrape 3819660
'NoneType' object has no attribute 'text'


133it [02:54,  1.11it/s]

Failed to scrape 3893781
'NoneType' object has no attribute 'text'


147it [03:10,  1.28it/s]

'NoneType' object has no attribute '__dict__'


150it [03:15,  1.43s/it]

Failed to scrape 3894522
'NoneType' object has no attribute 'text'


192it [04:17,  1.34s/it]

'NoneType' object has no attribute '__dict__'


193it [04:18,  1.23s/it]

Failed to scrape 3842582
'NoneType' object has no attribute 'text'


197it [04:21,  1.08it/s]

Failed to scrape 3818868
'NoneType' object has no attribute 'text'


199it [04:22,  1.41it/s]

'NoneType' object has no attribute '__dict__'


216it [04:53,  1.87s/it]

Failed to scrape 3810690
'NoneType' object has no attribute 'text'


223it [05:01,  1.02it/s]

'NoneType' object has no attribute '__dict__'

224it [05:01,  1.17it/s]




225it [05:03,  1.17s/it]

Failed to scrape 3830044
'NoneType' object has no attribute 'text'


230it [05:11,  1.42s/it]

Failed to scrape 3823420
'NoneType' object has no attribute 'text'


269it [06:17,  1.72s/it]

Failed to scrape 3888533
'NoneType' object has no attribute 'text'


302it [07:01,  1.05s/it]

'NoneType' object has no attribute '__dict__'


305it [07:06,  1.29s/it]

'NoneType' object has no attribute '__dict__'


336it [07:37,  1.15s/it]

'NoneType' object has no attribute '__dict__'


346it [07:49,  1.34s/it]

Failed to scrape 3900798
'NoneType' object has no attribute 'text'


349it [07:54,  1.26s/it]

'NoneType' object has no attribute '__dict__'


362it [08:02,  1.22it/s]

'NoneType' object has no attribute '__dict__'


390it [08:41,  2.21s/it]

Failed to scrape 3845947
'NoneType' object has no attribute 'text'


392it [08:44,  1.85s/it]

Failed to scrape 3886931
'NoneType' object has no attribute 'text'


405it [09:05,  1.80s/it]

Failed to scrape 3894003
'NoneType' object has no attribute 'text'
Failed to scrape 3890522
'NoneType' object has no attribute 'text'


415it [09:23,  2.12s/it]

Failed to scrape 3893428
'NoneType' object has no attribute 'text'


425it [09:37,  1.53s/it]

Failed to scrape 3815708
'NoneType' object has no attribute 'text'


432it [09:45,  1.07it/s]

'NoneType' object has no attribute '__dict__'


437it [09:51,  1.29s/it]

Failed to scrape 3845671

438it [09:53,  1.57s/it]


'NoneType' object has no attribute 'text'


457it [10:21,  1.67s/it]

Failed to scrape 3904774
'NoneType' object has no attribute 'text'


467it [10:34,  1.32s/it]

Failed to scrape 3890829
'NoneType' object has no attribute 'text'


469it [10:38,  1.45s/it]

Failed to scrape 3925299
'NoneType' object has no attribute 'text'


670it [11:03, 121.99it/s]

'NoneType' object has no attribute '__dict__'
'NoneType' object has no attribute '__dict__'
'NoneType' object has no attribute '__dict__'
'NoneType' object has no attribute '__dict__'
'NoneType' object has no attribute '__dict__'
'NoneType' object has no attribute '__dict__'
'NoneType' object has no attribute '__dict__'
'NoneType' object has no attribute '__dict__'
'NoneType' object has no attribute '__dict__'
'NoneType' object has no attribute '__dict__'


758it [11:04,  1.14it/s] 


'NoneType' object has no attribute '__dict__'
Failed to scrape 25 contracts
['3855433', '3873466', '3823135', '3890634', '3944937', '3819660', '3893781', '3894522', '3842582', '3818868', '3810690', '3830044', '3823420', '3888533', '3900798', '3845947', '3886931', '3894003', '3890522', '3893428', '3815708', '3845671', '3904774', '3890829', '3925299']
Top 3 lines
#1: bs4\element.py:1228: 3093.9 KiB
    self.parser_class = parser.__class__
#2: bs4\__init__.py:721: 2746.9 KiB
    tag = self.element_classes.get(Tag, Tag)(
#3: bs4\element.py:943: 2532.2 KiB
    u = str.__new__(cls, value)
2555 other: 15847.9 KiB
Total allocated size: 24220.9 KiB


### Now that the data is saved to the hard disk, we can run the below code without needing to rerun the scraping process

In [None]:
cachedContracts = []

with open('cachedContracts.csv', encoding="utf-8") as f:
    cachedContracts = list(csv.reader(f))


df = pd.DataFrame(columns=Contract().__dict__.keys(), data=cachedContracts)

df['signed'] = pd.to_datetime(df['signed'])
# df['deadline'] = pd.to_datetime(df['deadline'])

start = datetime.fromordinal(startDateBefore.toordinal()).strftime("%Y-%m-%d")
end = datetime.fromordinal(endDate.toordinal()).strftime("%Y-%m-%d")

print(start, end)

mask = (df['signed'] >= start) & (df['signed'] <= end)

selectedDatesDF = df.loc[mask]

selectedDatesDF.head()

2016-01-01 2016-01-31


  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listlike)
  cache_array = _maybe_cache(arg, format, cache, convert_listl

Unnamed: 0,id,price,signed,method,procurer,proinn,person,address,number,mail,code,product
5,3885158,1 516 160.00,2016-01-27,Закупка у единственного поставщика (исполнител...,"ОТКРЫТОЕ АКЦИОНЕРНОЕ ОБЩЕСТВО ""БОГДАНОВИЧСКАЯ ...",6633016739,Старостина Н.Б.,"623530, ОБЛАСТЬ СВЕРДЛОВСКАЯ,РАЙОН БОГДАНОВИЧС...",+7 (34376) 51744,natalyastaros@mail.ru,35.12.10.110 Услуги по передаче электроэнергии,Приобретение электроэнергии для производственн...
6,3859061,2 200 000.00,2016-01-21,Закупка у единственного поставщика (исполнител...,"МУНИЦИПАЛЬНОЕ АВТОНОМНОЕ УЧРЕЖДЕНИЕ ""ДВОРЕЦ КУ...",6606027650,,"624091, Свердловская обл, г Верхняя Пышма, пр-...",+7 (34368) 47555,dkuem@elem.ru,81.21.10.000 Услуги по общей уборке зданий,Оказание услуг по уборке помещений
7,3886413,1 794 743.67,2016-01-27,Закупка у единственного поставщика (исполнител...,"АКЦИОНЕРНОЕ ОБЩЕСТВО ""5 ЦЕНТРАЛЬНЫЙ АВТОМОБИЛЬ...",6659192672,,"620050, Свердловская обл, г Екатеринбург, ул Б...",+7 (343) 3229810,tender@5carz.ru,"35 Электроэнергия, газ, пар и кондиционировани...",Поставка газа и услуги по транспортировке
8,3903221,3 850 000.00,2016-01-29,Запрос предложений,"ОБЩЕСТВО С ОГРАНИЧЕННОЙ ОТВЕТСТВЕННОСТЬЮ ""УРАЛ...",6623091790,Белорусцев С.В.,"622051, ОБЛАСТЬ СВЕРДЛОВСКАЯ, Г. НИЖНИЙ ТАГИЛ,...",+7 (3435) 377420,belorustsev@ubtuvz.ru,71.20.1 Услуги в области технических испытаний...,Проведение комплекса ходовых испытаний опытног...
9,3862550,1 182 887.24,2016-01-22,Закупка у единственного поставщика (исполнител...,"АКЦИОНЕРНОЕ ОБЩЕСТВО ""ВОДОКАНАЛ""",6603017615,Кривоногов А.Г.,"624264, ОБЛАСТЬ СВЕРДЛОВСКАЯ,ГОРОД АСБЕСТ,УЛИЦ...",+7 (34365) 78125,watchan@yandex.ru,64.91.10.190 Услуги по финансовой аренде (лизи...,Финансовая аренда (лизинг) легкового автомобил...


In [None]:
selectedDatesDF.shape

(635, 12)

### Section 6: Output

Convert the list of contract classes to a dataframe so that they can be exported to a csv file

In [None]:
selectedDatesDF.to_csv("zakupki{}to{}.csv".format(startDateBefore, endDate))