In [1]:
import pandas as pd
import re
import datetime
import itertools

# A. Text Cleaning

In [2]:
comments = pd.read_csv('forum_comments_2022.csv')
comments.head()

Unnamed: 0,date,user_id,message
0,2022-01-01 03:53:00,benjaminh,\nAlthough the acceleration times in most car ...
1,2022-01-01 03:57:00,tjc78,\n\n@stickguy said:\nXM only comes on it if yo...
2,2022-01-01 04:08:00,tjc78,\n\n@qbrozen said:\nI don’t have the room for ...
3,2022-01-01 04:48:00,graphicguy,\nCongrats @stickguy ……very cool!
4,2022-01-01 06:03:00,au1994,\nHappy New Year all!Congrats @stickguy. I rea...


In [3]:
comments['message'][0]

'\nAlthough the acceleration times in most car magazines are somewhat bogus, they are still useful for comparison. Anyway, Car and Driver has just tested a Maverick 2.0 awd, and they got a 0-60 time of 5.9, which is not that far from a base BMW 330i at 5.6. Anyway, for the money the Maverick is a fast vehicle.https://www.caranddriver.com/reviews/a38516737/2022-ford-maverick-xlt-fx4-by-the-numbers/stickguy: Did you end up going for the 0% financing? If so, those payments of c. 750 a month are steep, but the equity will obviously build up very quickly. Even right now you could probably sell it for several thousand more than what you bought it for. The latest I heard on the chip shortage is that it will only slowly abate in 2022. According to one article I saw, things aren\'t likely to truly get back to "normal," whatever that is, probably until the second half of 2023. If true, this will likely mean that there won\'t be a sudden collapse of car prices. With all the lost production and pe

### Remove username in the message

In [4]:
def remove_uid(text):
    return re.sub(r'@\S+', '', text)

In [5]:
comments['cleaned'] = [remove_uid(w) for w in comments['message']]
comments['cleaned'][0]

'\nAlthough the acceleration times in most car magazines are somewhat bogus, they are still useful for comparison. Anyway, Car and Driver has just tested a Maverick 2.0 awd, and they got a 0-60 time of 5.9, which is not that far from a base BMW 330i at 5.6. Anyway, for the money the Maverick is a fast vehicle.https://www.caranddriver.com/reviews/a38516737/2022-ford-maverick-xlt-fx4-by-the-numbers/stickguy: Did you end up going for the 0% financing? If so, those payments of c. 750 a month are steep, but the equity will obviously build up very quickly. Even right now you could probably sell it for several thousand more than what you bought it for. The latest I heard on the chip shortage is that it will only slowly abate in 2022. According to one article I saw, things aren\'t likely to truly get back to "normal," whatever that is, probably until the second half of 2023. If true, this will likely mean that there won\'t be a sudden collapse of car prices. With all the lost production and pe

### Convert all text to lower case

In [6]:
comments['cleaned'] = comments['cleaned'].str.lower()
comments['cleaned'][0]

'\nalthough the acceleration times in most car magazines are somewhat bogus, they are still useful for comparison. anyway, car and driver has just tested a maverick 2.0 awd, and they got a 0-60 time of 5.9, which is not that far from a base bmw 330i at 5.6. anyway, for the money the maverick is a fast vehicle.https://www.caranddriver.com/reviews/a38516737/2022-ford-maverick-xlt-fx4-by-the-numbers/stickguy: did you end up going for the 0% financing? if so, those payments of c. 750 a month are steep, but the equity will obviously build up very quickly. even right now you could probably sell it for several thousand more than what you bought it for. the latest i heard on the chip shortage is that it will only slowly abate in 2022. according to one article i saw, things aren\'t likely to truly get back to "normal," whatever that is, probably until the second half of 2023. if true, this will likely mean that there won\'t be a sudden collapse of car prices. with all the lost production and pe

### Remove Unicode Character (URL)

In [7]:
import re

def remove_url(text):
    return re.sub(r'http\S+', '', text)

In [8]:
comments['cleaned'] = [remove_url(w) for w in comments['cleaned']]
comments['cleaned'][0]

'\nalthough the acceleration times in most car magazines are somewhat bogus, they are still useful for comparison. anyway, car and driver has just tested a maverick 2.0 awd, and they got a 0-60 time of 5.9, which is not that far from a base bmw 330i at 5.6. anyway, for the money the maverick is a fast vehicle. did you end up going for the 0% financing? if so, those payments of c. 750 a month are steep, but the equity will obviously build up very quickly. even right now you could probably sell it for several thousand more than what you bought it for. the latest i heard on the chip shortage is that it will only slowly abate in 2022. according to one article i saw, things aren\'t likely to truly get back to "normal," whatever that is, probably until the second half of 2023. if true, this will likely mean that there won\'t be a sudden collapse of car prices. with all the lost production and pent up demand we\'ve got, used and new vehicle prices are likely to remain high for the first half 

### Remove numbers

In [9]:
import re 
order = r'[0-9]'

# text as the row
def remove_numbers(text):
    filtered_text = re.sub(order,'',text)
    return filtered_text


In [10]:
comments['cleaned'] = [remove_numbers(w) for w in comments['cleaned']]
comments['cleaned'][0]

'\nalthough the acceleration times in most car magazines are somewhat bogus, they are still useful for comparison. anyway, car and driver has just tested a maverick . awd, and they got a - time of ., which is not that far from a base bmw i at .. anyway, for the money the maverick is a fast vehicle. did you end up going for the % financing? if so, those payments of c.  a month are steep, but the equity will obviously build up very quickly. even right now you could probably sell it for several thousand more than what you bought it for. the latest i heard on the chip shortage is that it will only slowly abate in . according to one article i saw, things aren\'t likely to truly get back to "normal," whatever that is, probably until the second half of . if true, this will likely mean that there won\'t be a sudden collapse of car prices. with all the lost production and pent up demand we\'ve got, used and new vehicle prices are likely to remain high for the first half of , and then only slowl

### Remove punctuation

Replace punctuation with white space to avoid string concatenation

In [11]:
import string

def remove_punctuation(text):
    filtered_text = text.translate(str.maketrans(string.punctuation," "*len(string.punctuation)))
    return filtered_text


In [12]:
comments['cleaned'] = [remove_punctuation(w) for w in comments['cleaned']]
comments['cleaned'][0]

'\nalthough the acceleration times in most car magazines are somewhat bogus  they are still useful for comparison  anyway  car and driver has just tested a maverick   awd  and they got a   time of    which is not that far from a base bmw i at    anyway  for the money the maverick is a fast vehicle  did you end up going for the   financing  if so  those payments of c   a month are steep  but the equity will obviously build up very quickly  even right now you could probably sell it for several thousand more than what you bought it for  the latest i heard on the chip shortage is that it will only slowly abate in   according to one article i saw  things aren t likely to truly get back to  normal   whatever that is  probably until the second half of   if true  this will likely mean that there won t be a sudden collapse of car prices  with all the lost production and pent up demand we ve got  used and new vehicle prices are likely to remain high for the first half of   and then only slowly s

### Tokenize: breaking sentences into words

Use TreebankWordTokenizer to not break the contradition word like "I'm" "I don't"

In [13]:
import nltk
from nltk import word_tokenize
from nltk.tokenize import TreebankWordTokenizer

def tokenize_word(text):
    return TreebankWordTokenizer().tokenize(text)

In [14]:
comments['tokenized'] = [tokenize_word(w) for w in comments['cleaned']]

comments['tokenized'].head()

0    [although, the, acceleration, times, in, most,...
1    [said, xm, only, comes, on, it, if, you, add, ...
2    [said, i, don’t, have, the, room, for, it, but...
3                             [congrats, ……very, cool]
4    [happy, new, year, all, congrats, i, really, r...
Name: tokenized, dtype: object

### Remove stop words

In [15]:
import nltk
from nltk.corpus import stopwords

stop_words = nltk.corpus.stopwords.words('english')

# Query as the row 
def remove_stopwords(query):
    result = [word for word in query if word not in stop_words]
    return result

In [16]:
comments['cleaned'] = [remove_stopwords(w) for w in comments['tokenized']]
comments['cleaned'].head()

0    [although, acceleration, times, car, magazines...
1    [said, xm, comes, add, lariat, luxury, package...
2    [said, don’t, room, someone, save, poor, thing...
3                             [congrats, ……very, cool]
4    [happy, new, year, congrats, really, really, l...
Name: cleaned, dtype: object

# B. Find the top-5 brands 

Find from the forum messages by calculating frequency counts. 

For each brand, the mention is counted only once per post.

In [17]:
brands = pd.read_csv('car_companies.csv')
brands.head()

Unnamed: 0,Make
0,SNVI
1,Zanella
2,Koller
3,Anasagasti
4,AutoLatina


In [18]:
# Extract the brand names
brands = brands['Make'].tolist()

# Convert to lowercase
brands = list(map(str.lower,brands))

brands[:5]

['snvi', 'zanella', 'koller', 'anasagasti', 'autolatina']

In [19]:
freqDict = dict(zip(brands, [0]*len(brands)))

# For each message
for i in range(len(comments)):
    # For every word in the message
    text = set(comments['cleaned'][i])
    for w in text:
        if w in brands:
            freqDict[w] += 1

In [20]:
newDict = dict((k, v) for k, v in freqDict.items() if v > 0)
print(newDict)

{'holden': 1, 'steyr': 1, 'sin': 1, 'beaumont': 1, 'brooks': 4, 'dennis': 1, 'derby': 2, 'dynasty': 1, 'monarch': 1, 'passport': 5, 'russell': 8, 'aero': 5, 'alpine': 8, 'bugatti': 14, 'peugeot': 10, 'renault': 3, 'alpina': 2, 'apollo': 1, 'audi': 260, 'bitter': 5, 'bmw': 461, 'fuso': 4, 'man': 171, 'opel': 46, 'porsche': 59, 'smart': 56, 'volkswagen': 15, 'borgward': 1, 'nag': 2, 'bet': 109, 'force': 25, 'tvs': 11, 'hero': 2, 'premier': 8, 'standard': 150, 'tmc': 2, 'abarth': 1, 'cts': 7, 'ducati': 3, 'ferrari': 18, 'fiat': 42, 'iso': 1, 'lancia': 2, 'lamborghini': 11, 'maserati': 14, 'zagato': 3, 'bertone': 4, 'fca': 3, 'rapid': 4, 'acura': 203, 'daihatsu': 3, 'dome': 7, 'honda': 271, 'infiniti': 87, 'isuzu': 12, 'kawasaki': 3, 'lexus': 84, 'mazda': 97, 'nissan': 163, 'subaru': 136, 'suzuki': 26, 'toyota': 317, 'yamaha': 3, 'datsun': 5, 'eunos': 1, 'stellantis': 12, 'buddy': 54, 'think': 2068, 'delta': 24, 'star': 19, 'umm': 6, 'yugo': 4, 'genesis': 42, 'hyundai': 195, 'kia': 126, 'd

In [21]:
sorted_brands = sorted(newDict.items(), key=lambda x:x[1], reverse=True)
sorted_dict = dict(sorted_brands)

print(sorted_dict)

{'think': 2068, 'ford': 575, 'local': 481, 'bmw': 461, 'jeep': 347, 'toyota': 317, 'white': 284, 'honda': 271, 'audi': 260, 'seat': 254, 'mini': 226, 'tesla': 215, 'acura': 203, 'hyundai': 195, 'man': 171, 'nissan': 163, 'standard': 150, 'subaru': 136, 'kia': 126, 'gm': 116, 'bet': 109, 'pilot': 102, 'mazda': 97, 'rivian': 97, 'ram': 95, 'infiniti': 87, 'lexus': 84, 'cadillac': 66, 'ac': 63, 'polestar': 60, 'chrysler': 60, 'porsche': 59, 'smart': 56, 'buddy': 54, 'lincoln': 54, 'google': 53, 'national': 47, 'opel': 46, 'buick': 46, 'dodge': 45, 'fiat': 42, 'genesis': 42, 'rover': 41, 'austin': 36, 'continental': 34, 'saturn': 30, 'suzuki': 26, 'force': 25, 'delta': 24, 'king': 24, 'pontiac': 24, 'moon': 23, 'cutting': 21, 'chevrolet': 20, 'star': 19, 'oldsmobile': 19, 'ferrari': 18, 'saab': 16, 'volkswagen': 15, 'bugatti': 14, 'maserati': 14, 'gmc': 14, 'micro': 13, 'bentley': 13, 'isuzu': 12, 'stellantis': 12, 'tvs': 11, 'lamborghini': 11, 'cord': 11, 'plymouth': 11, 'peugeot': 10, 'j

"THINK', 'LOCAL', and 'WHITE' are some common words that people use in their sentences, it is highly likely that they got misclassified as a car brand.

Additionaly, 'TH!NK' has filed bankrupty in 2011, "LOCAL" also shut down their factory in the beginning of 2022, and "WHITE" is an old brand back in the 1980s.

It is safe to say people were not referring to the brand "TH!NK", "LOCAL" or "WHITE" when they used these threee words.

Remove words that are commonly used in the sentence but not refer to the brand because the manufacturers have already been defuncted.

In [22]:
unwanted = ['think','local','white','seat','mini','standard','bet','buddy','man','smart','national','cutting']

for w in unwanted:
    del sorted_dict[w]

#sorted_dict

In [23]:
df = pd.DataFrame.from_dict(sorted_dict, orient='index', columns=['freq'])
df.index.name = 'brand'
df.head(10)

Unnamed: 0_level_0,freq
brand,Unnamed: 1_level_1
ford,575
bmw,461
jeep,347
toyota,317
honda,271
audi,260
tesla,215
acura,203
hyundai,195
nissan,163


The actual top-5 brand should be <b>Ford, BMW, Jeep, Toyota, and Honda</b>.

In [24]:
df.to_csv('mentioned_brands.csv')

# C. Identify top-3 co-mentioned brands

Calculate the frequency of co-mentions of brands.

For example, if Honda and Toyota are mentioned in the same post, then the co-mention frequency of Honda and Toyota increases by 1.

In [25]:
mentioned_brands = pd.read_csv('mentioned_brands.csv')
mentioned_brands

Unnamed: 0,brand,freq
0,ford,575
1,bmw,461
2,jeep,347
3,toyota,317
4,honda,271
...,...,...
143,hudson,1
144,hummer,1
145,kaiser,1
146,nash,1


For simplicity, only select brands that were mentioned over 20 times.

We cannot gain much insight if the brand were only mentioned very few time.

In [26]:
mentioned_brands = mentioned_brands[mentioned_brands['freq'] >= 20]
mentioned_brands.shape

(42, 2)

In [27]:
# Extract the metioned brand names
brands = mentioned_brands['brand'].tolist()
brands[:5]

['ford', 'bmw', 'jeep', 'toyota', 'honda']

In [28]:
# Create a dataframe of size messages to store all the brands mentioned in each message
comention = pd.DataFrame(index=range(len(comments)),columns=['brands_mentioned'])
comention['brands_mentioned'] = pd.np.empty((len(comention), 0)).tolist()
comention.head()

  comention['brands_mentioned'] = pd.np.empty((len(comention), 0)).tolist()


Unnamed: 0,brands_mentioned
0,[]
1,[]
2,[]
3,[]
4,[]


In [29]:
# For each message
for i in range(len(comments)):
    text = comments['cleaned'][i]
    for w in text:
        if w in brands and w not in comention['brands_mentioned'][i]:
            comention['brands_mentioned'][i].append(w)

comention

Unnamed: 0,brands_mentioned
0,[bmw]
1,[]
2,[]
3,[]
4,[]
...,...
13498,[]
13499,[mazda]
13500,[]
13501,[]


In [60]:
comention_dict = {}

for i in range(len(comention)):
    brands_in_msg = comention['brands_mentioned'][i]
    brands_in_msg.sort()

    keys = list(itertools.combinations(brands_in_msg,2))
    for k in keys:
        if k in comention_dict:
            comention_dict[k] += 1
        else:
            comention_dict[k] = 1

sorted_comention = dict(sorted(comention_dict.items(), key=lambda x:x[1], reverse=True))

In [61]:
comention_df = pd.DataFrame.from_dict(sorted_comention, orient='index', columns=['freq'])
comention_df.head(3)

Unnamed: 0,freq
"(acura, honda)",66
"(honda, toyota)",62
"(hyundai, kia)",52


The top 3 brand pairs which get mentioned together the most are:
- Acura and Honda
- Honda and Toyota
- Hyundai and KIA

# D. Brand-Attribute Co-Frequency

Find the 5 most frequently mentioned attributes of cars in the discussions.

Identify which attributes are most strongly associated with the top 5 brands we found in part B. 

In [34]:
top_brands = ['ford', 'bmw', 'jeep', 'toyota', 'honda']

### Extract noun and adjective

In [35]:
import spacy
nlp = spacy.load('en_core_web_sm')

def extract_noun_adj(query):
    result = []
    doc = nlp(" ".join(query))
    
    result = [w for w in doc if w.pos_ == "NOUN" or w.pos_ == "ADJ"]
    return result

In [36]:
comments['extracted'] = [extract_noun_adj(w) for w in comments['cleaned']]
comments['extracted'].head()

0    [acceleration, times, car, magazines, bogus, u...
1    [package, navi, option, fine, assume, phone, a...
2    [poor, thing, wrong, thrust, obvious, stuff, s...
3                                     [congrats, cool]
4                         [happy, new, year, congrats]
Name: extracted, dtype: object

### Lemmatize all the tokens

In [37]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_text(query):
    result = []

    for ele in query:
        result.append(lemmatizer.lemmatize(ele.text))
    
    return result

In [38]:
comments['extracted'] = [lemmatize_text(w) for w in comments['extracted']]
comments['extracted'].head()

0    [acceleration, time, car, magazine, bogus, use...
1    [package, navi, option, fine, assume, phone, a...
2    [poor, thing, wrong, thrust, obvious, stuff, s...
3                                     [congrats, cool]
4                         [happy, new, year, congrats]
Name: extracted, dtype: object

### Count term frequency

In [39]:
attribute_dict = {}

for i in range(len(comments)):
    terms_in_msg = comments['extracted'][i]
    terms_in_msg.sort()

    for k in terms_in_msg:
        if len(k) > 1:
            if k in attribute_dict:
                attribute_dict[k] += 1
            else:
                attribute_dict[k] = 1
    
attribute_dict = dict(sorted(attribute_dict.items(), key=lambda x:x[1], reverse=True))

In [40]:
attr_df = pd.DataFrame.from_dict(attribute_dict, orient='index', columns=['freq'])
attr_df.index.name = 'attributes'
attr_df.head(50)

Unnamed: 0_level_0,freq
attributes,Unnamed: 1_level_1
car,6327
year,3129
time,2842
new,2614
good,2040
dealer,2009
mile,1902
day,1833
lot,1648
price,1624


In [41]:
attr_df.to_csv('attr_freq.csv')

Given the word frequency ranking result, these five terms are the most frequently mentioned car attributes:

- mile, price, tire, engine, big

In [42]:
freq_car_attributes = ['mile', 'price', 'tire', 'engine', 'big']

### Count brand-attribute co-occurrence frequency

In [43]:
# Create an empty dataframe with attributes * top_brands
brand_attr_df = pd.DataFrame(0, columns=top_brands, index=freq_car_attributes)
brand_attr_df.head()

for i in range(len(comments)):
    msg = comments['cleaned'][i]

    brands = []
    attrs = []
    for m in msg:
        if m in top_brands and m not in brands:
            brands.append(m)
        if m in freq_car_attributes and m not in attrs:
            attrs.append(m)

    for a in attrs:
        for b in brands:
            brand_attr_df.loc[a][b] += 1


In [44]:
brand_attr_df['sum'] = brand_attr_df.sum(axis=1)
brand_attr_df.sort_values(by='sum', ascending=False)

Unnamed: 0,ford,bmw,jeep,toyota,honda,sum
price,64,92,27,41,53,277
engine,54,56,25,43,22,200
big,63,31,12,34,28,168
mile,20,25,11,13,18,87
tire,2,9,4,4,14,33


The attribute that is most strongly associated with the five brands is 'price'. It is the most strongly associated attribute for brand Ford, BMW, and JEEP.

Tesla is often mentioned with 'big', while Toyota is most frequently mentioned with 'engine'.

# E. Identify the most aspirational brand

To identify which brands are the most aspirational, I first extracted all the verbs, lemmatized them to present tense, and calculte the frequency counts for each verb.

From the top-200 most frequently used verbs, I selected 10+ that people usually use for express their desire on buying/owning a car. 

### Extract Verbs

In [45]:
import spacy
nlp = spacy.load('en_core_web_sm')

def extract_verb(query):
    result = []
    doc = nlp(" ".join(query))
    
    result = [w for w in doc if w.pos_ == "VERB"]
    return result

In [46]:
comments['aspired'] = [extract_verb(w) for w in comments['cleaned']]
comments['aspired'].head()

0    [tested, got, base, end, going, build, sell, b...
1    [said, comes, add, use, spend, playing, used, ...
2       [said, room, save, fix, concerned, see, agree]
3                                                   []
4                                                   []
Name: aspired, dtype: object

### Lemmatize all the verbs to present tense

In [47]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_verb(query):
    result = []

    for ele in query:
        result.append(lemmatizer.lemmatize(ele.text,'v'))
    
    return result

In [48]:
comments['aspired'] = [lemmatize_verb(w) for w in comments['aspired']]
comments['aspired'].head()

0    [test, get, base, end, go, build, sell, buy, h...
1    [say, come, add, use, spend, play, use, figure...
2          [say, room, save, fix, concern, see, agree]
3                                                   []
4                                                   []
Name: aspired, dtype: object

### Count verb frequency

In [49]:
verb_dict = {}

for i in range(len(comments)):
    terms_in_msg = comments['aspired'][i]
    terms_in_msg.sort()

    for k in terms_in_msg:
        if len(k) > 1:
            if k in verb_dict:
                verb_dict[k] += 1
            else:
                verb_dict[k] = 1
    
verb_dict = dict(sorted(verb_dict.items(), key=lambda x:x[1], reverse=True))

In [50]:
verb_df = pd.DataFrame.from_dict(verb_dict, orient='index', columns=['freq'])
verb_df.index.name = 'verbs'
verb_df.head(10)

Unnamed: 0_level_0,freq
verbs,Unnamed: 1_level_1
say,12305
get,6129
go,4409
think,3203
look,2658
make,2586
take,2345
drive,2188
see,2081
know,2042


In [51]:
verb_df.to_csv('verb_freq.csv')

From the top-200 verbs with highest frequency, I selected 10 common verbs that might express people's desire on buying or owning the car.

In [52]:
aspirational = ['want','buy','need','pay','pick',
                'like','love','own','enjoy','prefer']


### Count brand-desire co-occurrence frequency

In [53]:
# Select only top brands
mentioned_brands = pd.read_csv("mentioned_brands.csv")

most_mentioned_brands = mentioned_brands[mentioned_brands['freq'] >= 100]
most_mentioned_brands.shape

most_mentioned_brands = most_mentioned_brands['brand'].tolist()

In [54]:
# Create an empty dataframe with desire-verb * brands
brand_desire_df = pd.DataFrame(0, columns=most_mentioned_brands, index=aspirational)
brand_desire_df.head()

for i in range(len(comments)):
    msg = comments['cleaned'][i]
    verbs = comments['aspired'][i]

    brands = []
    desires = []
    for m in msg:
        if m in most_mentioned_brands and m not in brands:
            brands.append(m)
    for v in verbs:
        if v in aspirational and v not in desires:
            desires.append(v)

    for a in desires:
        for b in brands:
            brand_desire_df.loc[a][b] += 1


In [55]:
brand_desire_df

Unnamed: 0,ford,bmw,jeep,toyota,honda,audi,tesla,acura,hyundai,nissan,subaru,kia,gm,pilot
want,115,87,48,68,52,61,29,27,64,26,35,30,22,19
buy,128,90,33,50,45,63,21,28,49,42,24,43,34,20
need,80,56,66,48,31,60,31,12,29,24,14,20,21,14
pay,35,35,33,29,19,16,16,10,22,20,11,12,12,13
pick,30,30,22,29,10,13,7,11,15,5,6,8,9,10
like,29,35,19,24,15,20,8,13,21,10,8,8,7,12
love,23,26,12,9,8,12,10,5,26,14,8,8,4,1
own,15,37,20,16,14,19,16,5,10,7,7,3,5,1
enjoy,6,13,6,7,10,5,3,1,12,1,2,7,1,2
prefer,5,17,4,7,10,14,2,8,12,5,2,2,1,1


Based on the count frequency, in the year of 2022, Ford appears to be the most aspirational car brand in chronic buyers' selection.

Another way to interpret the result is to normalize it by dividing the count of co-occurrence by the brand occurrence frequency, which eliminates the impact of popularity on certain brands, but focus more on the porportion of the people that are interested in buying the car (when they talk about it)

In [56]:
# Normalize the count
for brand in most_mentioned_brands:
    freq = mentioned_brands['freq'][mentioned_brands['brand'] == brand]
    brand_desire_df[brand] = round(brand_desire_df[brand]/ int(freq),3)

brand_desire_df

Unnamed: 0,ford,bmw,jeep,toyota,honda,audi,tesla,acura,hyundai,nissan,subaru,kia,gm,pilot
want,0.2,0.189,0.138,0.215,0.192,0.235,0.135,0.133,0.328,0.16,0.257,0.238,0.19,0.186
buy,0.223,0.195,0.095,0.158,0.166,0.242,0.098,0.138,0.251,0.258,0.176,0.341,0.293,0.196
need,0.139,0.121,0.19,0.151,0.114,0.231,0.144,0.059,0.149,0.147,0.103,0.159,0.181,0.137
pay,0.061,0.076,0.095,0.091,0.07,0.062,0.074,0.049,0.113,0.123,0.081,0.095,0.103,0.127
pick,0.052,0.065,0.063,0.091,0.037,0.05,0.033,0.054,0.077,0.031,0.044,0.063,0.078,0.098
like,0.05,0.076,0.055,0.076,0.055,0.077,0.037,0.064,0.108,0.061,0.059,0.063,0.06,0.118
love,0.04,0.056,0.035,0.028,0.03,0.046,0.047,0.025,0.133,0.086,0.059,0.063,0.034,0.01
own,0.026,0.08,0.058,0.05,0.052,0.073,0.074,0.025,0.051,0.043,0.051,0.024,0.043,0.01
enjoy,0.01,0.028,0.017,0.022,0.037,0.019,0.014,0.005,0.062,0.006,0.015,0.056,0.009,0.02
prefer,0.009,0.037,0.012,0.022,0.037,0.054,0.009,0.039,0.062,0.031,0.015,0.016,0.009,0.01


# Conclusion

In 2022, Ford, BMW, Jeep, Toyota, and Honda were the most common brands people talked about. 

Acura and Honda have the highest co-mentioned count of 66. Resource shows that Honda is actually the parent company of Acura. While Honda and Toyota have been well-known competitors in the automobile industry for years, and they both come from Japan. Hyundai and KIA, as the third pair, also come from the same country, which in some way explains why people always discuss them together.

In terms of attributes, price is the most important factor that people care about, followed by "engine" and if the car is "big" or not. 

Last, among all the manufacturers, Ford appeared to be the most popular brand in 2022. The different analyzing method shows that people also have a solid aspiration to buy Hyundai, KIA, and Audi.