## Python script used to scrape vehicle complaints information from [www.carcomplaints.com](http://www.carcomplaints.com) website.

In [200]:
from collections import OrderedDict
from bs4 import BeautifulSoup
import urllib.request as request
import re

url_Honda = 'http://www.carcomplaints.com/Honda/'
html_Honda = request.urlopen(url_Honda)

soup_Honda = BeautifulSoup(html_Honda)

### Honda Models Overall Complaint Counts (http://www.carcomplaints.com/Honda/)

In [201]:
ul = soup_Honda.find_all('ul', class_='column bar',id=re.compile('c*'))
ul

[<ul class="column bar" id="c1">
<li><a href="/Honda/Accord/" title="Honda Accord complaints (8,560)">Accord</a> <span class="count">8,560</span> <span class="index" style="width: 100%;"> </span></li>
<li><a href="/Honda/Accord_Crosstour/" title="Honda Accord Crosstour complaints (6)">Accord Crosstour</a> <span class="count">6</span> <span class="index" style="width: 5%;"> </span></li>
<li><a href="/Honda/Accord_Hybrid/" title="Honda Accord Hybrid complaints (12)">Accord Hybrid</a> <span class="count">12</span> <span class="index" style="width: 5%;"> </span></li>
<li><a href="/Honda/Ballade/" title="Honda Ballade complaints (1)">Ballade</a> <span class="count">1</span> <span class="index" style="width: 5%;"> </span></li>
<li><a href="/Honda/Brio/" title="Honda Brio complaints (1)">Brio</a> <span class="count">1</span> <span class="index" style="width: 5%;"> </span></li>
<li><a href="/Honda/City/" title="Honda City complaints (22)">City</a> <span class="count">22</span> <span class="ind

As you can see from above, the data I want (the model name and # of complaints are in the &lt;li&gt; tags).  I will make a Python dict of this data:

In [202]:
honda_model_counts_dict = {}
num_column_data = len(ul)  # The data is divided up in arbitrary number of columns per HTML page source
for i in range(num_column_data):  # For each column of data...
    for row in ul[i].find_all('li'):
        honda_model_counts_dict[row.a.get_text()] = int(row.span.get_text().replace(",",""))

In [203]:
honda_model_counts_dict

{'Accord': 8560,
 'Accord Crosstour': 6,
 'Accord Hybrid': 12,
 'Ballade': 1,
 'Brio': 1,
 'CR-V': 750,
 'CR-Z': 11,
 'CRX': 4,
 'City': 22,
 'Civic': 3775,
 'Civic Hybrid': 174,
 'Crosstour': 9,
 'Del Sol': 2,
 'Element': 112,
 'Fit': 133,
 'Fit EV': 0,
 'Insight': 12,
 'Jazz': 12,
 'Odyssey': 1555,
 'Orthia': 1,
 'Passport': 66,
 'Pilot': 522,
 'Prelude': 54,
 'Ridgeline': 81,
 'S2000': 4}

### Acura Models Overall Complaint Counts

Using the same procedure I did for Honda, I will get the Acura models and their respective complaint counts.

In [204]:
from collections import OrderedDict
from bs4 import BeautifulSoup
from IPython.display import HTML
import urllib.request as request
import re

url_Acura = 'http://www.carcomplaints.com/Acura/'
html_Acura = request.urlopen(url_Acura)

soup_Acura = BeautifulSoup(html_Acura)
ul = soup_Acura.find_all('ul', class_='column bar',id=re.compile('c*'))

acura_model_counts_dict = {}
num_column_data = len(ul)  # The data is divided up in arbitrary number of columns
for i in range(num_column_data):  # For each column of data...
    for row in ul[i].find_all('li'):
        acura_model_counts_dict[row.a.get_text()] = int(row.span.get_text().replace(",",""))

OD_Acura = OrderedDict(sorted(acura_model_counts_dict.items(), key=lambda t: t[1], reverse=True))

s_header = '<table border="1"><tr><th>Model Name</th><th># of Complaints</th></tr>'

s_data = ''
for key in OD_Acura.keys():
    s_data = s_data + '<tr><td align="center">' + key + '</td>' + '<td align="center">' + str(OD_Acura[key]) + '</td></tr>'

s_footer = "</table>"

h = HTML(s_header+s_data+s_footer);h

Model Name,# of Complaints
TL,109
MDX,50
TSX,36
Legend,27
Integra,25
RDX,20
CL,17
RSX,15
RL,11
1.7EL,5


#### Honda version executed all in one cell

In [205]:
from collections import OrderedDict
from bs4 import BeautifulSoup
from IPython.display import HTML
import urllib.request as request
import re

url_Honda = 'http://www.carcomplaints.com/Honda/'
html_Honda = request.urlopen(url_Honda)

soup_Honda = BeautifulSoup(html_Honda)
ul = soup_Honda.find_all('ul', class_='column bar',id=re.compile('c*'))

honda_model_counts_dict = {}
num_column_data = len(ul)  # The data is divided up in arbitrary number of columns per HTML page source
for i in range(num_column_data):  # For each column of data...
    for row in ul[i].find_all('li'):
        honda_model_counts_dict[row.a.get_text()] = int(row.span.get_text().replace(",",""))
        
OD_Honda = OrderedDict(sorted(honda_model_counts_dict.items(), key=lambda t: t[1], reverse=True))

s_header = '<table border="1"><tr><th>Model Name</th><th># of Complaints</th></tr>'

s_data = ''
for key in OD_Honda.keys():
    s_data = s_data + '<tr><td align="center">' + key + '</td>' + '<td align="center">' + str(OD_Honda[key]) + '</td></tr>'

s_footer = "</table>"

h = HTML(s_header+s_data+s_footer);h

Model Name,# of Complaints
Accord,8560
Civic,3775
Odyssey,1555
CR-V,750
Pilot,522
Civic Hybrid,174
Fit,133
Element,112
Ridgeline,81
Passport,66


#### OK, now that I've shown how to obtain the number of complaints for each model in a step-by-step manner, it is time to make a function out of this so we can re-use all this code

In [253]:
from bs4 import BeautifulSoup
import urllib.request as request
import re

def getCountsByModel(make):
    """Method that returns the number of complaints for each model based on vehicle make
    Applicable make values are: 'Honda','Acura','Ford','GM',etc
    Method returns a dictionary where the key is the model, value is the qty of complaints"""
    
    url = 'http://www.carcomplaints.com/'
    url_make = url+make+'/'
    html_make = request.urlopen(url_make)
    
    soup = BeautifulSoup(html_make)
    ul = soup.find_all('ul', class_='column bar',id=re.compile('c*'))
    
    make_model_counts_dict = OrderedDict()
    num_column_data = len(ul)  # The data is divided up in arbitrary number of columns per HTML page source
    for i in range(num_column_data):  # For each column of data...
        for row in ul[i].find_all('li'):
            make_model_counts_dict[row.a.get_text()] = int(row.span.get_text().replace(",",""))
            
    return make_model_counts_dict

In [254]:
getCountsByModel('Honda')

OrderedDict([('Accord', 8560), ('Accord Crosstour', 6), ('Accord Hybrid', 12), ('Ballade', 1), ('Brio', 1), ('City', 22), ('Civic', 3775), ('Civic Hybrid', 174), ('CR-V', 750), ('CR-Z', 11), ('Crosstour', 9), ('CRX', 4), ('Del Sol', 2), ('Element', 112), ('Fit', 133), ('Fit EV', 0), ('Insight', 12), ('Jazz', 12), ('Odyssey', 1555), ('Orthia', 1), ('Passport', 66), ('Pilot', 522), ('Prelude', 54), ('Ridgeline', 81), ('S2000', 4)])

#### I also made a function to get all available makes at carcomplaints.com

In [208]:
from bs4 import BeautifulSoup
import urllib.request as request
import re

def getMakes():
    """Function to get all the makes available at carcomplaints.com"""
    
    url = 'http://www.carcomplaints.com/'
    html = request.urlopen(url)
    
    soup = BeautifulSoup(html)
    sections = soup.find_all('section', id=re.compile('makes'))
    
    make_list = []
    for section in range(len(sections)):
        for li in sections[section].find_all('li'):
            make_list.append(li.get_text())
    
    return make_list

In [209]:
getMakes()

['Acura',
 'Audi',
 'BMW',
 'Buick',
 'Cadillac',
 'Chevrolet',
 'Chrysler',
 'Dodge',
 'Ford',
 'GMC',
 'Honda',
 'Hyundai',
 'Infiniti',
 'Isuzu',
 'Jeep',
 'Kia',
 'Lexus',
 'Lincoln',
 'Mazda',
 'Mercedes-Benz',
 'Mercury',
 'Mini',
 'Mitsubishi',
 'Nissan',
 'Oldsmobile',
 'Plymouth',
 'Pontiac',
 'Porsche',
 'Ram',
 'Saab',
 'Saturn',
 'Scion',
 'Subaru',
 'Toyota',
 'Volvo',
 'VW',
 'Alfa Romeo',
 'AMC',
 'Bentley',
 'Chery',
 'Daewoo',
 'Datsun',
 'Eagle',
 'Ferrari',
 'Fiat',
 'Geo',
 'Holden',
 'HSV',
 'Hummer',
 'Jaguar',
 'Kenworth',
 'Lamborghini',
 'Land Rover',
 'Mahindra',
 'Maruti',
 'Opel',
 'Peugeot',
 'Renault',
 'Rover',
 'Seat',
 'Skoda',
 'Suzuki',
 'Tata',
 'Tesla',
 'Vauxhall',
 'Zimmer']

#### Function to get available model years and complaint qty from a give model

In [221]:
from bs4 import BeautifulSoup
import urllib.request as request
import re

def getYearCounts(make, model):
    """Function that returns a Python dict that contains model years and their complaint qty"""
    
    url = 'http://www.carcomplaints.com/'+make+'/'+model+'/'
    html = request.urlopen(url)

    soup = BeautifulSoup(html)
    li = soup.find_all('li', id=re.compile('bar*'))

    year_counts_dict = {}
    for item in li:
        year_counts_dict[int(item.find('span',class_='label').get_text())]=int(item.find('span',class_='count').get_text().replace(",",""))
    
    return year_counts_dict

In [222]:
getYearCounts('Honda','Accord')

{1979: 1,
 1986: 15,
 1987: 2,
 1988: 13,
 1989: 16,
 1990: 33,
 1991: 69,
 1992: 39,
 1993: 37,
 1994: 44,
 1995: 20,
 1996: 38,
 1997: 52,
 1998: 392,
 1999: 313,
 2000: 424,
 2001: 488,
 2002: 836,
 2003: 1447,
 2004: 460,
 2005: 184,
 2006: 141,
 2007: 212,
 2008: 2031,
 2009: 696,
 2010: 243,
 2011: 123,
 2012: 112,
 2013: 78,
 2014: 1}

#### Function to get Top Systems by Qty

In [198]:
from bs4 import BeautifulSoup
from collections import OrderedDict
import urllib.request as request
import re

def getTopSystemsQty(make, model, year):
    """Function that returns an OrderedDict containing system problems and their complaint qty"""
    
    url = 'http://www.carcomplaints.com/'+make+'/'+model+'/'+str(year)+'/'
    html = request.urlopen(url)

    soup = BeautifulSoup(html)
    li = soup.find_all('li', id=re.compile('bar*'))
    
    problem_counts_dict = OrderedDict()  # We want to maintain insertion order
    for item in li:
        try:
            problem_counts_dict[item.a['href'][:-1]]=int(item.span.get_text().replace(",",""))
        except:
            pass
        
    return problem_counts_dict

In [199]:
getTopSystemsQty(year=2012, make='Honda', model='CR-V')

OrderedDict([('electrical', 21), ('accessories-interior', 11), ('engine', 7), ('body_paint', 5), ('transmission', 2), ('wheels_hubs', 2), ('cooling_system', 1), ('steering', 1), ('suspension', 1)])

In [280]:
getTopSystemsQty(year=2001, make='Nissan', model='Altima')

OrderedDict([('engine', 144), ('electrical', 6), ('windows_windshield', 6), ('cooling_system', 3), ('transmission', 3), ('AC_heater', 2), ('body_paint', 2), ('drivetrain', 2), ('steering', 2), ('accessories-interior', 1), ('brakes', 1), ('exhaust_system', 1), ('fuel_system', 1), ('suspension', 1), ('wheels_hubs', 1)])

#### Function to get number of corresponding qty of NHTSA complaints for each system category

In [297]:
from bs4 import BeautifulSoup
from collections import OrderedDict
import urllib.request as request
import re


def getNhtsaSystemsQty(make, model, year):
    """Function that returns an OrderedDict containing qty of NHTSA complaints by system"""
    
    url = 'http://www.carcomplaints.com/'+make+'/'+model+'/'+str(year)+'/'
    html = request.urlopen(url)

    soup = BeautifulSoup(html)

    nhtsa = soup.find_all('em', class_='nhtsa')

    nhtsa_counts = []
    for item in nhtsa:
        try:
            # There are 3 string tokens separated by whitespace, i want the 3rd token which is the qty
            nhtsa_counts.append(int(item.span.get_text().split()[2]))
        except:
            # Unfortunately, some only have 2 tokens
            nhtsa_counts.append(int(item.span.get_text().split()[1]))

    systems = soup.find_all('li', id=re.compile('bar*'))

    systems_list = []
    for item in systems:
        systems_list.append(item.a['href'][:-1]) # Remove the ending forward slash

    nhtsa_systems_counts = list(zip(systems_list,nhtsa_counts))
    
    nhtsa_systems_qty_dict = OrderedDict()
    for item in nhtsa_systems_counts:
        nhtsa_systems_qty_dict[item[0]]=item[1]
    
        
    return nhtsa_systems_qty_dict

In [298]:
getNhtsaSystemsQty('Honda','Accord','2001')

OrderedDict([('transmission', 129), ('seat_belts_air_bags', 327), ('engine', 76), ('body_paint', 8), ('electrical', 49), ('accessories-interior', 41), ('AC_heater', 1), ('brakes', 57), ('exhaust_system', 5), ('windows_windshield', 10), ('cooling_system', 1), ('drivetrain', 53), ('lights', 5), ('suspension', 20), ('fuel_system', 16), ('steering', 11), ('wheels_hubs', 24), ('miscellaneous', 6), ('accessories-exterior', 4), ('clutch', 1)])

#### Function to get qty of complaints by sub-system

In [251]:
from bs4 import BeautifulSoup
from collections import OrderedDict
import urllib.request as request
import re

make = 'Honda'
model = 'Civic'
year = 2001
system = 'transmission'

def getSubSystemsQty(make, model, year, system):
    """Function that will return an OrderedDict of # of complaints by sub-system"""
    
    url = 'http://www.carcomplaints.com/'+make+'/'+model+'/'+str(year)+'/'+system+'/'
    html = request.urlopen(url)
    soup = BeautifulSoup(html)

    li = soup.find_all('li', id=re.compile('bar*'))

    subsystem_counts_dict = OrderedDict()  # We want to maintain insertion order
    for item in li:
        subsystem_counts_dict[item.a['href'].split(".")[0]]=int(item.span.get_text().replace(",",""))
        
    return subsystem_counts_dict

In [316]:
getSubSystemsQty('Honda','Accord','2000','transmission')

OrderedDict([('transmission_slipping', 96), ('transmission_failure', 83), ('jerks_in_gear', 27), ('clunking_between_gears_when_shifting', 11), ('loss_of_power', 9), ('loud_noise_from_transmission', 7), ('transmission_fluid_leak', 2), ('not_shifting_properly', 1), ('power_train-automatic_transmission', 130), ('power_train-automatic_transmission-torque_converter', 4), ('power_train-automatic_transmission-cooling_unit_and_lines', 2), ('power_train-automatic_transmission-lever_and_linkage-floor_shift', 2), ('power_train-automatic_transmission-lever_and_linkage-column_shift', 1), ('power_train-manual_transmission', 1)])

#### Function to get the review text for a specific system failure

In [317]:
from bs4 import BeautifulSoup
import urllib.request as request
import re

def getReviews(make, model, year, system, subsystem):
    """Function that returns a list of customer reviews"""
    
    url = 'http://www.carcomplaints.com/'+make+'/'+model+'/'+str(year)+'/'+system+'/'+subsystem+'.shtml'
    html = request.urlopen(url)
    soup = BeautifulSoup(html)

    reviews = soup.find_all('div', itemprop="reviewBody")
    
    complaints = []
    for complaint in reviews:
        complaints.append(complaint.p.get_text())
        
    return complaints

In [318]:
for complaint in getReviews('Honda','Civic','2001','transmission','pops_out_of_gear'):
    print(complaint)
    print('*'*120)

This is the second time that the speed sensor has gone out in the last nine months.
************************************************************************************************************************
Syncros are bad, pops out of gear and doesn't get back in.. throw out bearing was bad too
************************************************************************************************************************
The transmission of my 2001 civic began popping out of 4th gear on occasion. The car was my wifes and I didn't normally drive it. When I did and if popped out during acceleration she told me it had been doing it for a while. We decided to get a new car since we saw a trend since we just put in new struts and had the anti-lock break system replaced for about $1000 each . The Honda dealership test drove it and said it needed a new transmission. This had a large impact on the trade in value. I need to keep an eye on this site and when I see if our 2010 civic might have issues and 