# **ACM Research Coding Challenge**
This project aims to answer the question "What colors of cars are most expensive?". To do this, the data from the cars_raw.csv file will first be cleaned up. Then, the cars will be categorized based on their color. Then, the average prices of the categories with their margin of error will be calculated and visualized in a graph to determine and analyze the most expensive colors of cars. 

In [269]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats

# **Data Conversion**
I stored the data from the cars_raw.csv file into a data file using the pandas library.

In [270]:
chart = pd.read_csv('../input/carsforsale/cars_raw.csv')
chart.head()

# **Data Cleanup: Price**
First, I will find out the data type of the Price column in order to see if I can use it as it is. 

In [271]:
chart['Price'].unique()

Unfortunately, the values in the Price column are objects, meaning that they need to be converted to integer values. Currently, the values are represented as strings because they have a dollar sign and comma (currency format). Moreover, some of the prices are not listed. This would make the data harder to analyze, but since there is such a large collection of prices, removing the very small amount of cars without prices should not affect the sample. Therefore, I first removed the rows without prices from the dataset, then, I converted the remaining strings into integers. 

In [272]:
chart = chart[chart['Price'] != 'Not Priced']
chart['Price']=[eval(x.replace('$', '').replace(',', '')) for x in chart['Price']]
chart['Price'].head() #to confirm that the prices are now numbers and integer types

# **Data Cleanup: Exterior Color**
All the cars in the dataset have a wide variety of color names; however, many names were synonyms or similar colors (for example, "Granite Crystal" and "Modern Steel" both mean silver). Thus, to simplify data analysis, I printed a list of the unique colors and categorized most of them into their basic color. Any color I missed was categorized into the "Other" category. Ideally, the "Other" color category contains cars without color names, but realistically, it also contains cars with exotic color names.

In [273]:
chart['ExteriorColor'].unique()

In [274]:
chart['ExteriorColor']=[x.lower() for x in chart['ExteriorColor']]
exterior=chart['ExteriorColor'].tolist()
for i in range(len(chart['ExteriorColor'])):
    color=exterior[i]
    if 'red' in color or 'crimson' in color or 'ruby' in color or 'lava' in color or 'rosso' in color or 'snazzberry' in color :
        exterior[i]="Red"
    elif 'orange' in color or 'mango' in color or "arancio" in color:
        exterior[i]="Orange"
    elif 'yellow' in color or 'gold' in color:
        exterior[i]="Yellow"
    elif 'green' in color or 'moss' in color or 'olive' in color or 'lime' in color or 'jade' in color:
        exterior[i]="Green"
    elif 'blue' in color or 'blu' in color or 'aqua' in color:
        exterior[i]="Blue"
    elif 'brown' in color or 'bronze' in color or 'walnut' in color or 'coffee' in color or 'mocha' in color or 'copper' in color or 'autumn' in color:
        exterior[i]="Brown"
    elif 'beige' in color or 'sand' in color or 'desert' in color or 'tan' in color or "cashmere" in color or 'khaki' in color or 'stone' in color:
        exterior[i]="Beige"
    elif 'white' in color or 'blizzard' in color or 'snow' in color or 'pearl' in color or 'ivory' in color or 'frost' in color or 'quartz' in color or 'bianco' in color or 'iridium' in color:
        exterior[i]="White"
    elif 'black' in color or 'shadow' in color or 'ebony' in color or 'night' in color or 'caviar' in color or 'coal' in color or 'nero' in color:
        exterior[i]="Black"
    elif 'gray' in color or 'silver' in color or 'grey' in color or 'grigio' in color or 'nickel' in color or 'steel' in color or 'granite' in color or 'graphite' in color or 'slate' in color or 'gun' in color or 'magnet' in color or ('metal' in color and not ('metallic' in color)) or 'platinum' in color or 'smoke' in color or 'rhino' in color or 'cement' in color or 'pewter' in color:
        exterior[i]="Silver"
    else:
        exterior[i]="Other"
chart['ExteriorColor']=exterior
chart['ExteriorColor'].head() #to confirm that the code above worked

# **Data Analysis**
To answer the question, the mean price and margin of error for each color need to be found. The mean can be determined using the .mean() method. To find the margin of error, the standard deviation of the prices for each color and the number of cars in each color need to be calculated first. Then the error margin can be determined. 

In [275]:
meansList=[]
StdevList=[]
errorsList=[]
sizeList=[]
for color in chart["ExteriorColor"].unique():
    colorMean=chart[chart['ExteriorColor'] == color]['Price'].mean()
    colorStdev=chart[chart['ExteriorColor'] == color]['Price'].std()
    meansList.append(colorMean)
    StdevList.append(colorStdev)
    n=len(chart[chart['ExteriorColor'] == color])
    sizeList.append(n)
    errorMargin = colorStdev / np.sqrt(n) * stats.t.ppf(1-0.05/2, n - 1) #the stats.t function calculates the t-value for each color. I do not need to find the t-value from a chart independently before-hand.
    errorsList.append(errorMargin)

# **Data Visualization**
First, the data calculated in the previous section need to be grouped into one dataframe. Since the question is about the average price, the data needs to be sorted by the mean price. Next, I plotted the data into a bar chart to visually show the average price of each color (with their margins of error) and to show which colors were on average more or less expensive. 

In [276]:
plotData = pd.DataFrame(dict(colors=chart["ExteriorColor"].unique(),size=sizeList,means=meansList,errors=errorsList))
plotData=plotData.sort_values('means')
print(plotData) #to confirm that the data is sorted

In [277]:
colorsList = ['Purple' if color == 'Other' else color for color in plotData['colors'].tolist()] #adds a color to the "Other" bar on chart
plt.figure(figsize=(10,6))
plt.bar("colors",'means',data=plotData[["means", "colors"]],color=colorsList,yerr=plotData["errors"].tolist(),edgecolor='black')
plt.title('Average Price of Different Car Colors (With 95% Confidence Interval)', fontsize=18)
plt.xlabel('Color', fontsize=15)
plt.ylabel('Price ($)', fontsize=15)
plt.show()

# **Conclusion**
Surprisingly, the category that contains cars without color names or with exotic names is the most expensive. Ignoring the "Other" category, the three most expensive colors for car exteriors were black, orange, and white. The three least expensive colors for car exteriors were brown, beige, and red. However, four important details need to be explored. Firstly, the data contains new and used cars. Specifically, used car price is heavily affected by not only its color, but also by its mileage, condition, and history. This somewhats destablizes the equal field for comparing prices, as all those factors are the same in new cars but differ with used cars. Secondly, these results are not statistically significant. As shown by the margins of error, many of them overlap. This means that, with another sample, the colors could be in different position. For example, with a different sample of black and orange cars, the average for orange may be greater than the average for black. This is most likely due to the fact that (Thirdly), there is a considerable size difference in the colors. For example, there were only 24 yellow cars, but there were about 3000 black cars in the sample. The size of a sample affects the precision of the estimates. Larger samples tend to be associated with a smaller margin of error (as seen in the chart) and greater precision. Fourthly, the "Other" category contains 370 cars. This is a sizable portion of the sample (much greater than cars without prices), meaning that any of the colors could move up or down with a deeper dive into the "Other" cars. Thus, the current answer to the question is that a mix of neutrals (black and white as #1 and #3) and bright colors (orange and green as #2 and #4) are the most expensive car exterior colors, but that could change with a deeper level of data analysis and/or a different sample. 