# Getting data for our NN
we will be scraping our data from carwale.com's used car listings, then we will refine and clean it for our Neural network training.

### after analysing html page of carwale's listings
we can see that they embed listing data right into script tag as json. and each page have around 30 listings only.  
so there are two main ways to get more data,   
1. Scraping multiple pages of each city.  
2. Scraping 30 examples from around 500 cities.  

so i think scraping from different cities will be a better idea for the following resons:  
1. It seems to be possible way simpler than pagenating.
2. having data from a long list of cities will give us richer data. i.e., more diversity.

so now let's get a list of about 500 indian cities (one word names).
then i will be iterating over each city's page one by one.

In [4]:
from bs4 import BeautifulSoup
import requests, certifi
import json
import pandas

CARDATA = []

indian_cities_500 = ['palghar', 'aligarh', 'siwan', 'jammu', 'junagadh', 'shajapur', 'angul', 'kendrapara', 'vizianagaram', 'ahmedabad', 'darbhanga', 'vidisha', 'bengaluru', 'azamgarh', 'kutch', 'koppal', 'pali', 'lucknow', 'amravati', 'madanapalle', 'latur', 'bahraich', 'bharuch', 'palgharmumbai', 'morbi', 'muzaffarnagar', 'manendragarh', 'patna', 'fatehpur', 'satna', 'churu', 'parbhani', 'bijnor', 'subarnapur', 'farrukhabad', 'bargarh', 'satara', 'banaskantha', 'kannauj', 'dhenkanal', 'devbhoomiDwarka', 'jalna', 'ongole', 'tiruppur', 'khammam', 'tonk', 'kollam', 'nanded', 'karimnagar', 'neemuch', 'basti', 'chhapra', 'ramanagara', 'jalgaon', 'jabalpur', 'telangana', 'shivpuri', 'chittorgarh', 'raisen', 'badaun', 'sagar', 'hapur', 'mysore', 'dakshinakannada', 'madurai', 'visakhapatnam', 'adilabad', 'chandauli', 'udupi', 'ankleshwar', 'kolkata', 'lakhimpur', 'katni', 'dongargarh', 'barmer', 'chennai', 'raigad', 'surat', 'hingoli', 'khurda', 'betul', 'mandsaur', 'bastar', 'narsinghpur', 'tirunelveli', 'aurangabad', 'bulandshahr', 'allahabad', 'wardha', 'mahoba', 'davangere', 'etawah', 'malegaon', 'mandya', 'nandurbar', 'buldhana', 'chandigarh', 'jalore', 'nalgonda', 'gonda', 'amritsar', 'solapur', 'warangal', 'coimbatore', 'barabanki', 'ambikapur', 'jamui', 'ghazipur', 'jhunjhunu', 'yavatmal', 'kanpurdehat', 'rajahmundry', 'chikkaballapur', 'hazaribagh', 'amethi', 'silvassa', 'kawardha', 'kodagu', 'bikaner', 'chamarajanagar', 'korba', 'giridih', 'chittoor', 'chhatarpur', 'gaya', 'karwar', 'kushinagar', 'gadag', 'sirohi', 'mathura', 'shravasti', 'karaikal', 'mysuru', 'bhilai', 'pakur', 'haveri', 'navsari', 'osmanabad', 'purnia', 'jodhpur', 'kasganj', 'daltonganj', 'tikamgarh', 'gorakhpur', 'rampur', 'sitapur', 'dewas', 'cuttack', 'rohtak', 'baripada', 'sambalpur', 'dhar', 'bokaro', 'rourkela', 'bhiwandi', 'koraput', 'mahasamund', 'gadchiroli', 'sangli', 'gurgaon', 'pratapgarh', 'faridabad', 'mumbai', 'nagaur', 'burhanpur', 'ratlam', 'uttarakannada', 'nellimarla', 'meerut', 'rewa', 'botad', 'saharanpur', 'hoshangabad', 'damoh', 'deoria', 'kochi', 'sehore', 'bhilwara', 'gwalior', 'kurnool', 'kolhapur', 'gandhinagar', 'dehradun', 'porbandar', 'ludhiana', 'rajsamand', 'sonbhadra', 'panipat', 'chikmagalur', 'agar', 'akola', 'asansol', 'machilipatnam', 'belagavi', 'belgaum', 'udaipur', 'simdega', 'dhanbad', 'westgodavari', 'kadapa', 'puri', 'ashoknagar', 'bangalore', 'durgapur', 'jeypore', 'ramgarh', 'tumkur', 'rangareddy', 'bundi', 'shimoga', 'sasaram', 'mahe', 'mau', 'dumka', 'jaisalmer', 'maharajganj', 'champa', 'budaun', 'siddharthnagar', 'bharatpur', 'ajmer', 'vijayawada', 'nellore', 'delhi', 'girSomnath', 'chhindwara', 'baloda', 'salem', 'bhiwani', 'yanam', 'begusarai', 'nabrangpur', 'jhansi', 'puducherry', 'banda', 'munger', 'nashik', 'nuapada', 'auraiya', 'dhule', 'arrah', 'rajmahal', 'ahmednagar', 'shivamogga', 'moradabad', 'panna', 'surajpur', 'eastgodavari', 'jagdalpur', 'katihar', 'pithampur', 'mainpuri', 'nalanda', 'rayagada', 'agra', 'medininagar', 'bhubaneswar', 'ballia', 'kanpur', 'jamnagar', 'balod', 'mandla', 'pune', 'seoni', 'raipur', 'krishna', 'koderma', 'jalaun', 'umaria', 'ranchi', 'raichur', 'karauli', 'varanasi', 'chirmiri', 'alirajpur', 'amreli', 'baghpat', 'nizamabad', 'noida', 'datia', 'khargone', 'balasore', 'mehsana', 'bokarosteelcity', 'sikar', 'prakasam', 'hardoi', 'kolar', 'latehar', 'khandwa', 'raebareli', 'pilibhit', 'srikakulam', 'guwahati', 'sindhudurg', 'sabarkantha', 'dindori', 'mirzapur', 'panchmahal', 'medak', 'jaipur', 'sultanpur', 'balrampur', 'mahabubnagar', 'rajgarh', 'ratnagiri', 'diu', 'unnao', 'chitrakoot', 'thane', 'kota', 'buxar', 'jamalpur', 'jhabua', 'lohardaga', 'hubli', 'guntur', 'jajpur', 'beed', 'chapra', 'deoghar', 'eluru', 'hosur', 'palasa', 'firozabad', 'ambala', 'gulbarga', 'bhavnagar', 'raigarh', 'chitradurga', 'ayodhya', 'dholpur', 'barwani', 'jalandhar', 'nayagarh', 'howrah', 'jaunpur', 'godda', 'nellai', 'gopalpur', 'bareilly', 'bilaspur', 'rajkot', 'maheshtala', 'gondia', 'valsad', 'surendranagar', 'durg', 'sidhi', 'etah', 'singrauli', 'bhopal', 'ghaziabad', 'kanpurnagar', 'baleshwar', 'vadodara', 'kakinada', 'chandrapur', 'srinagar', 'indore', 'shahjahanpur', 'bijapur', 'alwar', 'anuppur', 'nagpur', 'tirupati', 'davanagere', 'bhagalpur', 'lalitpur', 'washim', 'muzaffarpur', 'ujjain', 'sawaiMadhopur', 'samastipur', 'siliguri', 'hamirpur', 'naviMumbai', 'bhind', 'bhandara', 'bellary', 'chaibasa', 'hassan', 'motihari', 'keonjhar', 'suryapet', 'morena', 'bagalkot', 'anantapur', 'hyderabad', 'guna', 'shamli', 'daman', 'tiruchirappalli', 'bettiah']


browser = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

for city in (indian_cities_500):
    try:
        req = requests.get('https://www.carwale.com/used/' + city + '/', stream=True, headers=browser, verify=certifi.where()).text

        soup = BeautifulSoup(req)
        for i in range(3):
            try:
                jsonList = soup.head.find_all('script')[i].text

                carDict = json.loads(jsonList)

                # print(jsonList)

                listCar = carDict['@graph'][0]['itemListElement']
                for car in listCar:
                    title = car["name"]
                    model = car["model"]
                    vehicleDate = car["vehicleModelDate"]
                    price = car["offers"]["price"]
                    brand = car["Brand"]["name"]
                    kmDriven = car["mileageFromOdometer"]["value"]
                    fuelType = car["fuelType"]
                    bodyType = car["bodyType"]
                    seatCap = car["seatingCapacity"]
                    prevOwns = car["numberOfPreviousOwners"]
                    transmission = car["vehicleTransmission"]

                    CARDATA.append([title, model, vehicleDate, price, brand, kmDriven, fuelType, bodyType, seatCap, prevOwns, transmission])

                print(city + " done!")
                break
            except:
                if i==2:
                    print(city + " failed!")
                    break
                else:
                    pass
        
    except:
        print(city + " failed!")


tableCar = pandas.DataFrame(CARDATA, columns=['Name','Model','Date of assembly', 'Price', 'Brand', 'KM Driven', 'Fuel Type', 'Body Type', 'Seat capicity', 'No. of prev owners', 'Transmission'])

print(CARDATA[0], "and", len(CARDATA) - 1, "more")


palghar done!
aligarh done!
siwan done!
jammu done!
junagadh done!
shajapur done!
angul done!
kendrapara done!
vizianagaram done!
ahmedabad done!
darbhanga done!
vidisha done!
bengaluru done!
azamgarh done!
kutch done!
koppal done!
pali done!
lucknow done!
amravati done!
madanapalle done!
latur done!
bahraich done!
bharuch done!
palgharmumbai failed!
morbi done!
muzaffarnagar done!
manendragarh done!
patna done!
fatehpur done!
satna done!
churu done!
parbhani done!
bijnor done!
subarnapur failed!
farrukhabad done!
bargarh done!
satara done!
banaskantha done!
kannauj done!
dhenkanal done!
devbhoomiDwarka failed!
jalna done!
ongole done!
tiruppur done!
khammam done!
tonk done!
kollam done!
nanded done!
karimnagar done!
neemuch done!
basti done!
chhapra done!
ramanagara done!
jalgaon done!
jabalpur done!
telangana failed!
shivpuri done!
chittorgarh done!
raisen done!
badaun done!
sagar done!
hapur done!
mysore done!
dakshinakannada failed!
madurai done!
visakhapatnam done!
adilabad done!


so now the scraping is complete and we have got around 8700 examples in a csv file, but all these examples aren't ready to be used as training data.  
we would need to first remove duplicate examples

In [5]:
uniqueTableCar = tableCar.drop_duplicates()
uniqueTableCar.to_csv('UniqueCars.csv', index=False)

so now we are left with 5145 examples only, these are all unique real life listings.  
we will now turn this data into a numpy array to actually use it as a training data.