Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArityException with cookbook sample 4 #315

Open
1 task done
jriekhof opened this issue Feb 24, 2021 · 5 comments
Open
1 task done

ArityException with cookbook sample 4 #315

jriekhof opened this issue Feb 24, 2021 · 5 comments

Comments

@jriekhof
Copy link

jriekhof commented Feb 24, 2021

  • I have read through the quick start and installation sections of the README.

Info

Info Value
Operating System macOS Big Sur 11.2.1
Geni Version 0.0.38
Spark Java lib Version org.apache.spark/spark-XXX_2.12 "3.1.0"
JDK openjdk version "1.8.0_282"
Spark Version 3.0.2

Problem / Steps to reproduce

Standard lein new geni …, bitnami/spark 3.0.2 docker, then used code from geni cookbook chapter 4.

The following code from cookbook example 4 fails with ArityException:

(def null-counts
    (-> raw-weather-mar-2012
        (g/agg (->> (g/column-names raw-weather-mar-2012)
                    (map #(vector % (g/null-count %)))
                    (into {})))
        (g/first)))

Exception is

Execution error (AnalysisException) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt/failAnalysis (package.scala:42).
cannot resolve '`Precip. Amount (mm)`' given input columns: [Climate ID, Date/Time (LST), Day, Dew Point Temp (°C), Dew Point Temp Flag, Hmdx, Hmdx Flag, Latitude (y), Longitude (x), Month, Precip. Amount (mm), Precip. Amount Flag, Rel Hum (%), Rel Hum Flag, Station Name, Stn Press (kPa), Stn Press Flag, Temp (°C), Temp Flag, Time (LST), Visibility (km), Visibility Flag, Weather, Wind Chill, Wind Chill Flag, Wind Dir (10s deg), Wind Dir Flag, Wind Spd (km/h), Wind Spd Flag, Year];
'Aggregate [sum(cast(cast(isnull(Latitude (y)#13272) as int) as bigint)) AS Latitude (y)#16363L, sum(cast(cast(isnull(Stn Press Flag#13295) as int) as bigint)) AS Stn Press Flag#16364L, sum(cast(cast(isnull(Temp (°C)#13280) as int) as bigint)) AS Temp (°C)#16365L, sum(cast(cast(isnull(Wind Spd (km/h)#13290) as int) as bigint)) AS Wind Spd (km/h)#16366L, sum(cast(cast(isnull(Rel Hum Flag#13285) as int) as bigint)) AS Rel Hum Flag#16367L, sum(cast(cast(isnull(Date/Time (LST)#13275) as int) as bigint)) AS Date/Time (LST)#16368L, sum(cast(cast(isnull(Visibility Flag#13293) as int) as bigint)) AS Visibility Flag#16369L, sum(cast(cast(isnull(Visibility (km)#13292) as int) as bigint)) AS Visibility (km)#16370L, sum(cast(isnull('Precip. Amount (mm)) as int)) AS Precip. Amount (mm)#16371, sum(cast(cast(isnull(Dew Point Temp Flag#13283) as int) as bigint)) AS Dew Point Temp Flag#16372L, sum(cast(isnull('Precip. Amount Flag) as int)) AS Precip. Amount Flag#16373, sum(cast(cast(isnull(Station Name#13273) as int) as bigint)) AS Station Name#16374L, sum(cast(cast(isnull(Wind Chill Flag#13299) as int) as bigint)) AS Wind Chill Flag#16375L, sum(cast(cast(isnull(Longitude (x)#13271) as int) as bigint)) AS Longitude (x)#16376L, sum(cast(cast(isnull(Time (LST)#13279) as int) as bigint)) AS Time (LST)#16377L, sum(cast(cast(isnull(Dew Point Temp (°C)#13282) as int) as bigint)) AS Dew Point Temp (°C)#16378L, sum(cast(cast(isnull(Rel Hum (%)#13284) as int) as bigint)) AS Rel Hum (%)#16379L, sum(cast(cast(isnull(Wind Dir Flag#13289) as int) as bigint)) AS Wind Dir Flag#16380L, sum(cast(cast(isnull(Climate ID#13274) as int) as bigint)) AS Climate ID#16381L, sum(cast(cast(isnull(Wind Dir (10s deg)#13288) as int) as bigint)) AS Wind Dir (10s deg)#16382L, sum(cast(cast(isnull(Stn Press (kPa)#13294) as int) as bigint)) AS Stn Press (kPa)#16383L, sum(cast(cast(isnull(Year#13276) as int) as bigint)) AS Year#16384L, sum(cast(cast(isnull(Temp Flag#13281) as int) as bigint)) AS Temp Flag#16385L, sum(cast(cast(isnull(Hmdx#13296) as int) as bigint)) AS Hmdx#16386L, ... 6 more fields]
+- Project [Longitude (x)#13211 AS Longitude (x)#13271, Latitude (y)#13212 AS Latitude (y)#13272, Station Name#13213 AS Station Name#13273, Climate ID#13214 AS Climate ID#13274, Date/Time (LST)#13215 AS Date/Time (LST)#13275, Year#13216 AS Year#13276, Month#13217 AS Month#13277, Day#13218 AS Day#13278, Time (LST)#13219 AS Time (LST)#13279, Temp (°C)#13220 AS Temp (°C)#13280, Temp Flag#13221 AS Temp Flag#13281, Dew Point Temp (°C)#13222 AS Dew Point Temp (°C)#13282, Dew Point Temp Flag#13223 AS Dew Point Temp Flag#13283, Rel Hum (%)#13224 AS Rel Hum (%)#13284, Rel Hum Flag#13225 AS Rel Hum Flag#13285, Precip. Amount (mm)#13226 AS Precip. Amount (mm)#13286, Precip. Amount Flag#13227 AS Precip. Amount Flag#13287, Wind Dir (10s deg)#13228 AS Wind Dir (10s deg)#13288, Wind Dir Flag#13229 AS Wind Dir Flag#13289, Wind Spd (km/h)#13230 AS Wind Spd (km/h)#13290, Wind Spd Flag#13231 AS Wind Spd Flag#13291, Visibility (km)#13232 AS Visibility (km)#13292, Visibility Flag#13233 AS Visibility Flag#13293, Stn Press (kPa)#13234 AS Stn Press (kPa)#13294, ... 6 more fields]
   +- Relation[Longitude (x)#13211,Latitude (y)#13212,Station Name#13213,Climate ID#13214,Date/Time (LST)#13215,Year#13216,Month#13217,Day#13218,Time (LST)#13219,Temp (°C)#13220,Temp Flag#13221,Dew Point Temp (°C)#13222,Dew Point Temp Flag#13223,Rel Hum (%)#13224,Rel Hum Flag#13225,Precip. Amount (mm)#13226,Precip. Amount Flag#13227,Wind Dir (10s deg)#13228,Wind Dir Flag#13229,Wind Spd (km/h)#13230,Wind Spd Flag#13231,Visibility (km)#13232,Visibility Flag#13233,Stn Press (kPa)#13234,... 6 more fields] csv

Also manual select with column named "Precip. Amount (mm)" does not work. It seems that they have backticks around them internally.

I tried to rename all columns with

(g/to-df weather-data
               "Longitude (x)" "Latitude (y)" "Station Name" "Climate ID" "Date/Time (LST)" "Year" "Month" "Day"
               "Time (LST)" "Temp (°C)" "Temp Flag" "Dew Point Temp (°C)" "Dew Point Temp Flag" "Rel Hum (%)"
               "Rel Hum Flag" "Precip. Amount (mm)" "Precip. Amount Flag" "Wind Dir (10s deg)" "Wind Dir Flag"
               "Wind Spd (km/h)" "Wind Spd Flag" "Visibility (km)" "Visibility Flag" "Stn Press (kPa)" "Stn Press Flag"
               "Hmdx" "Hmdx Flag" "Wind Chill" "Wind Chill Flag" "Weather"))

but the problem persists, still backticks.
Crashes: (g/select raw-weather-mar-2012 "Precip. Amount (mm)")
Works: (g/select raw-weather-mar-2012 "`Precip. Amount (mm)`")

(g/column-names (g/select raw-weather-mar-2012 "`Precip. Amount (mm)`"))
yields "Precip. Amount (mm)" without backticks.

This lead me to believe that there is some issue in geni or spark with these column names.

@anthony-khong
Copy link
Member

Huh... That's very odd... Thank you for flagging this up! I've added this to my TODO list. I think I'll have time to look into it next week. I hope that's okay!

@jriekhof
Copy link
Author

jriekhof commented Feb 26, 2021 via email

@anthony-khong
Copy link
Member

Hi Jochen, thank you for the kind words!! Glad you’re enjoying it. Please let me know if you have any feature requests.

On this issue, I think this is related: https://mungingdata.com/pyspark/avoid-dots-periods-column-names/

Basically Spark doesn’t like column names with dots. One thing we can do is to auto-escape it, but I’m not sure if this is the best solution, because you lose the Spark correspondence. I’m leaning towards having a ‘safe-mode’ that is on by default in the read functions that basically scans the column names for dots, and replace it with underscores 🤔

@jriekhof
Copy link
Author

jriekhof commented Feb 26, 2021 via email

@jriekhof
Copy link
Author

jriekhof commented Mar 8, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants