## Exercises: Filter and query data

In [79]:
import pandas as pd
autos = pd.read_json("../Data/autos.json")

**Use the "autos" dataset to answer the following questions:**

**Q1:** How many cars from Jaguar are in the dataset, and what is the price of the most expensive one?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;count=3
&nbsp;&nbsp;&nbsp;price=36000.0
</details>

In [80]:
# autos[autos["make"] == "jaguar"]               # Filtrerar fram för att få alla jaguar
# autos.query("make == 'jaguar'")                # Filtrerar fram men med query

# autos[autos["make"] == "jaguar"].sort_values(by = ["price"], ascending = False)       # Sorterar i nedåtgående ordning på pris

jaguar_cars = (autos["make"] == "jaguar").sum()    # Här räknas summan ut via "boolean series" då sum() ej funkar på listan ovan
jaguar_cars

autos.query("make == 'jaguar'").sort_values(by = ["price"], ascending = False)        # Sorterar i nedåtgående ordning på pris via query

Unnamed: 0,aspiration,body-style,bore,city-mpg,compression-ratio,curb-weight,drive-wheels,engine-location,engine-size,engine-type,...,make,normalized-losses,num-of-cylinders,num-of-doors,peak-rpm,price,stroke,symboling,wheel-base,width
49,std,sedan,3.54,13,11.5,3950,rwd,front,326,ohcv,...,jaguar,,twelve,two,5000.0,36000.0,2.76,0,102.0,70.6
48,std,sedan,3.63,15,8.1,4066,rwd,front,258,dohc,...,jaguar,,six,four,4750.0,35550.0,4.17,0,113.0,69.6
47,std,sedan,3.63,15,8.1,4066,rwd,front,258,dohc,...,jaguar,145.0,six,four,4750.0,32250.0,4.17,0,113.0,69.6


**Q2:** How many cars from Toyota are in the dataset, and what is the price of the most expensive one?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;count=32
&nbsp;&nbsp;&nbsp;price=17669.0
</details>

In [81]:
toyota_cars = (autos["make"] == "toyota").sum()
toyota_cars

autos.query("make == 'toyota'").sort_values(by = ["price"], ascending = False).head(1)

Unnamed: 0,aspiration,body-style,bore,city-mpg,compression-ratio,curb-weight,drive-wheels,engine-location,engine-size,engine-type,...,make,normalized-losses,num-of-cylinders,num-of-doors,peak-rpm,price,stroke,symboling,wheel-base,width
172,std,convertible,3.62,24,9.3,2975,rwd,front,146,ohc,...,toyota,134.0,four,two,4800.0,17669.0,3.5,2,98.4,65.6


**Q3:** What is the length, width and height of the most expensive car in the entire dataset?

<details>
<summary>Answer</summary>
<br>&nbsp;&nbsp;&nbsp;length=199.2
&nbsp;&nbsp;&nbsp;width=72.0
&nbsp;&nbsp;&nbsp;height=55.4
</details>

In [82]:
# autos_sorted2 = autos.sort_values(by = ["price"], ascending = False).head(1)   # Med detta hade man kunnat direkt ta ut längt osv för denna variabel då vi enbart tar fram raden längst upp
autos_sorted = autos.sort_values(by = ["price"], ascending = False)

# Retrieve the first row (most expensive car)
most_expensive_car = autos_sorted.iloc[0]

# Extract the length, width, and height of the most expensive car
length = most_expensive_car["length"]
width = most_expensive_car["width"]
height = most_expensive_car["height"]


print("Length:", length)
print("Width:", width)
print("Height:", height)


# Om man vill skapa en dataframe med de tre värdena direkt

autos_sorted3 = autos.sort_values(by ="price", ascending = False)[["length", "width", "height"]]      # Här kan kolumnerna skrivas direkt på slutet, inget komma då vi inte filtrerar raderna med iloc/loc osv

autos_sorted3.head(1)


## Anledning till varför man inte ska ange autos_sorted3 = autos[autos[osv]]

# autos.sort_values(by="price", ascending=False): This part of the code sorts the DataFrame autos based on the "price" column in descending order and returns a new sorted DataFrame. Let's call this sorted DataFrame "autos_sorted."
# autos[autos_sorted[["length", "width", "height"]]]: In this part of the code, 
# you are trying to use "autos_sorted" as a boolean index to filter rows 
# from the original DataFrame "autos." However, this usage doesn't align with 
# how boolean indexing is typically done in pandas.

Length: 199.2
Width: 72.0
Height: 55.4


Unnamed: 0,length,width,height
74,199.2,72.0,55.4


**Q4:** What is the lowest price per horsepower in the dataset, and what brand ("make") is that car?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;price per horsepower=72.84
&nbsp;&nbsp;&nbsp;brand=Toyota
</details>

In [83]:
autos.head()

autos[["horsepower", "price"]]     # Create a datafrmae with only horsepower and price. 
autos["horsepower"]                # Create a Series by only extracting the horsepower column, since it is done as a list within list (2D)
autos[["horsepower"]]              # Create a Dataframe by only extracting the horsepower column, since it is done as a list within list (2D)

autos["price_per_horsepower"] = autos["price"]/autos["horsepower"]     # all rows perform the operation

autos.sort_values(by = "price_per_horsepower").head(1)[["make", "price_per_horsepower"]]   # First the rows are sorted, then row at the top is chosen, then last part chooses the columns.

# autos.info()   # I see NaN values

# autos.loc[autos['price_per_horsepower'].isna(), ["price", "horsepower"]]   # Too see what columns made the NaN values. filter on rows with values NaN
# in price_per_horsepower column, and to see columns make and price because the row NaN values are based on values from these columns

# As above but sorting for one feature in the rows:
# autos[autos["make"] == "volvo"].sort_values(by = "price_per_horsepower")[["make", "price_per_horsepower"]]

# For info and education about filtering rows and columns
# autos.loc[autos['make'] == 'toyota', ['make', 'price_per_horsepower']]   # Left part of the comma is filtering the rows to include only those where make == toyota, 
                                                                         # and the right side of the comma selects the columns



Unnamed: 0,make,price_per_horsepower
167,toyota,72.836207


**Q5:** How many of the cars in the dataset has as many cylinders as they have doors?

<details>
<summary>Answer</summary>
<br>
&nbsp;&nbsp;&nbsp;cars=95
</details>

In [84]:
autos[autos["num-of-cylinders"] == autos["num-of-doors"]]      # Creates a dataframe filtering all rows of cylinders and doors are equal
                                                               # boolean indexing
# Till skilland från:
# autos["price"]    # här används column indexing
# Men i den första används ' boolean indexing' som returnerar false och true, 
# Sedan används dessa för att indexera original DataFrame, väljandes de rader med True/False

# autos[autos["num-of-cylinders"] == autos["num-of-doors"]].sum()   # Obs här appliceras sum på hela dataframet och ger tillbara series med summor

count_of_true_rows = (autos["num-of-cylinders"] == autos["num-of-doors"]).sum()    # måste assignas till en variabel om man vill få antal True

count_of_true_rows



95

**Use the "autos" dataset and write python code to solve the following tasks:**

**T1:** Print a string to inform the user of the price difference between the cheapest and the most expensive car in the dataset.

<details>
<summary>Solution</summary>
<br>
&nbsp;&nbsp;&nbsp;<b>Example:</b><br>
&nbsp;&nbsp;&nbsp;The price difference between the cheapest and the most expensive car is 40282.0
</details>

In [92]:
cheapest_car = autos.sort_values(by = "price").head(1)["price"]
most_expensive_car = autos.sort_values(by = "price", ascending = False).head(1)["price"]

# cheapest_car.iloc[0]    # Get the value in the series, otherwise number of index is also obtained and operation not possible
# cheapest_car = autos.sort_values(by = "price").head(1)["price"].iloc[0]    # This is also possible


print(f"The difference between the most expensive and the cheapest car is {most_expensive_car.iloc[0]- cheapest_car.iloc[0]} USD")




The difference between the most expensive and the cheapest car is 40282.0 USD


**T2:** Ask the user to input a brand, then print the price range for that brand.

<details>
<summary>Solution</summary>
<br>
&nbsp;&nbsp;&nbsp;<b>Example 1:</b><br>
&nbsp;&nbsp;&nbsp;Input the name of a brand: volvo<br>
&nbsp;&nbsp;&nbsp;The prices for cars of brand 'volvo' ranges from 12940.0 to 22625.0<br>
<br>
&nbsp;&nbsp;&nbsp;<b>Example 2:</b><br>
&nbsp;&nbsp;&nbsp;Input the name of a brand: toyota<br>
&nbsp;&nbsp;&nbsp;The prices for cars of brand 'toyota' ranges from 5348.0 to 17669.0<br>
<br>
&nbsp;&nbsp;&nbsp;<b>Example 3:</b><br>
&nbsp;&nbsp;&nbsp;Input the name of a brand: tesla<br>
&nbsp;&nbsp;&nbsp;The brand 'tesla' does not exists in the dataset.<br>
</details>

**T3:** Ask the user to input a brand, then print the number of cars in the dataset for that brand, and all attributes for a random sample car of that brand.
<details>
<summary>Solution</summary>
<br>
&nbsp;&nbsp;&nbsp;<b>Example:</b><br>
&nbsp;&nbsp;&nbsp;Input the name of a brand: mazda<br>
&nbsp;&nbsp;&nbsp;There are 17 cars of brand 'mazda' in the dataset.<br><br>
&nbsp;&nbsp;&nbsp;Here is the data for a random 'mazda' car:<br>
&nbsp;&nbsp;&nbsp;aspiration = std<br>
&nbsp;&nbsp;&nbsp;body-style = sedan<br>
&nbsp;&nbsp;&nbsp;bore = 3.03<br>
&nbsp;&nbsp;&nbsp;city-mpg = 31<br>
&nbsp;&nbsp;&nbsp;compression-ratio = 9.0<br>
&nbsp;&nbsp;&nbsp;curb-weight = 1945<br>
&nbsp;&nbsp;&nbsp;drive-wheels = fwd<br>
&nbsp;&nbsp;&nbsp;engine-location = front<br>
&nbsp;&nbsp;&nbsp;engine-size = 91<br>
&nbsp;&nbsp;&nbsp;engine-type = ohc<br>
&nbsp;&nbsp;&nbsp;fuel-system = 2bbl<br>
&nbsp;&nbsp;&nbsp;fuel-type = gas<br>
&nbsp;&nbsp;&nbsp;height = 54.1<br>
&nbsp;&nbsp;&nbsp;highway-mpg = 38<br>
&nbsp;&nbsp;&nbsp;horsepower = 68.0<br>
&nbsp;&nbsp;&nbsp;length = 166.8<br>
&nbsp;&nbsp;&nbsp;make = mazda<br>
&nbsp;&nbsp;&nbsp;normalized-losses = 113.0<br>
&nbsp;&nbsp;&nbsp;num-of-cylinders = four<br>
&nbsp;&nbsp;&nbsp;num-of-doors = four<br>
&nbsp;&nbsp;&nbsp;peak-rpm = 5000.0<br>
&nbsp;&nbsp;&nbsp;price = 6695.0<br>
&nbsp;&nbsp;&nbsp;stroke = 3.15<br>
&nbsp;&nbsp;&nbsp;symboling = 1<br>
&nbsp;&nbsp;&nbsp;wheel-base = 93.1<br>
&nbsp;&nbsp;&nbsp;width = 64.2<br>
</details>

**T4:** Ask the user to input a brand, then export all cars of that brand to a .csv file with the same name as the brand.

<details>
<summary>Solution</summary>
<br>
&nbsp;&nbsp;&nbsp;<b>Example 1:</b><br>
&nbsp;&nbsp;&nbsp;Input the name of a brand: volkswagen<br>
&nbsp;&nbsp;&nbsp;Exported 12 cars to 'volkswagen.csv'<br>
<br>
&nbsp;&nbsp;&nbsp;<b>Example 2:</b><br>
&nbsp;&nbsp;&nbsp;Input the name of a brand: tesla<br>
&nbsp;&nbsp;&nbsp;The brand 'tesla' does not exists in the dataset.<br>
</details>